﻿ Radu Tudor Ionescu Habilitation Thesis Knowledge Transfer between Computer Vision, Text Mining and Computational Biology: New Chapters Department of Computer Science University of Bucharest April 2018 To my dear family, friends and colleagues Abstract In our previous research, we studied various methods and algorithms that emerged from the basic idea of transferring knowledge between three diﬀerent domains: computer vision, natural language processing and computational biology Although, at the ﬁrst sight, these domains seem to be unrelated ﬁelds of study, a deeper look reveals that there are many common concepts and, in fact, images, text documents or DNA strings care be processed with similar techniques As will be shown by the end of this thesis, the concept of treating images and strings in a similar fashion is very fertile for speciﬁc ap- plications in computer vision, text mining and computational biology In fact, one of the state-of-the-art methods for image categorization is inspired from the bag-of-wordsrepresentation, which is very popu- lar in information retrieval and natural language processing Indeed, thebag-of-visual-wordsmodel, which builds a vocabulary of visual words by clustering local image descriptors extracted from images, has demonstrated impressive levels of performance for image catego- rization and image retrieval By adapting string processing techniques to image analysis or the other way around, knowledge from one do- main can be transferred to the other In fact, many breakthrough discoveries have been made by transferring knowledge between dif- ferent domains This thesis follows this line of research and presents novel approaches or improved methods that rely on knowledge trans- fer First of all, we present a kernel function that is designed to encode the spatial information in an eﬃcient way The kernel is applied both in object class recognition from images and text categorization by topic and it shows performance improvements compared to the standard bag-of-words representation and to the spatial pyramid rep- resentation Second of all, we present an improved version of the bag-of-words model for text classiﬁcation tasks, which is based on adapting the bag-of-visual-words model The adaptation consists of replacing the image descriptors useful for recognizing object patterns in images with word embeddings useful for recognizing semantic pat- terns in text documents Third of all, we describe an approach for outlier detection based on eliminating the smaller k-means clusters The approach is applied in abnormal event detection in video and in word sense disambiguation, showing state-of-the-art results in both tasks Fourth of all, we present a new distance measure for strings Designed to conform to general principles while being better adapted for DNA strings, the new distance comes to improve several state- of-the-art methods for DNA sequence alignment We also present an adaptation of the distance measure for exemplar gesture recognition in video Both distances have the same source of inspiration, an im- age dissimilarity measure based on patches that was itself inspired by another distance for strings, known as rank distance Fifth of all, we also turn our focus to unsupervised learning methods To this end, we present an approach for abnormal event detection that requires no training videos The approach is based onunmasking, an unsu- pervised method that was previously used for authorship veriﬁcation in text documents We also present an unsupervised algorithm for word sense disambiguation that draws its inspiration from a popular method used in genetics for whole genome sequencing To summarize, all the contributions presented in this thesis come to support the con- cept of treating images, text documents and DNA strings in a similar manner Machine learning is currently a vast area of research with applications in a broad range of ﬁelds, such as computer vision, bioinformatics, information retrieval, natural language processing, audio processing, data mining, and many others Among the variety of state-of-the-art machine learning approaches for such applications, are deep learn- ing and similarity-based learning methods Deep learning is about training neural models that usually have more than a few sequential layers, in an end-to-end fashion Learning based on similarity refers to the process of learning based on pairwise similarities between the training samples The similarity-based learning process can be both supervised and unsupervised, and the pairwise relationship can be either a similarity, a dissimilarity, or a distance function On one hand, this thesis presents an unsupervised and a supervised method for abnormal event detection in video Both methods extract appearance features using state-of-the-art deep convolutional neural networks On the other hand, this thesis studies several similarity- based learning approaches, such as nearest neighbor models, kernel methods and clustering algorithms A nearest neighbor model based on a novel distance measure for matching temporal sequences is pre- sented in this thesis It is used for exemplar sign language recognition in video, achieving promising results Kernel methods are used in several tasks investigated in this thesis First, a novel kernel for vi- sual word histograms is described It improves the performance for object recognition in images by encoding spatial information in a very eﬃcient way Several kernels based on a pyramid representation are also presented They are used for object class recognition from im- ages, text categorization by topic and sentiment analysis Among the variety of clustering algorithms, k-means is the most prevalent in this thesis It is used to build codebooks of visual wordsfor object class recognition as well as codebooks ofsuper word embeddingsfor text classiﬁcation The k-means clustering algorithm is also employed as an outlier detection method which is applied in abnormal event detec- tion in video as well as word sense disambiguation at the document- level An interesting pattern can already be observed, namely that the machine learning tasks approached in this thesis can be divided into two diﬀerent areas: computer vision and string processing While exploring both areas, it is worth mentioning that the studied meth- ods exhibit state-of-the-art performance levels, and some of them have been presented in top-tier conferences such as ACL, EACL or ICCV Contents Contents vi 1 Motivation and Overview 1 1 1 Introduction 1 1 2 Knowledge Transfer between Image and Text 4 1 3 Overview and Organization 13 References 18 2 Similarity-based Learning and Deep Learning 30 2 1 Introduction 31 2 2 Similarity-based Learning 32 2 3 Nearest Neighbor Approach 33 2 4 Kernel Methods 37 2 4 1 Mathematical Preliminaries 38 2 4 2 Overview of Kernel Classiﬁers 41 2 4 3 Kernel Functions 43 2 4 4 Kernel Normalization 45 2 4 5 Generic Kernel Algorithm 46 2 4 6 Multiple Kernel Learning 48 2 5 Cluster Analysis 49 2 5 1 K-Means Clustering 51 2 5 2 Hierarchical Clustering 52 2 6 Deep Learning 54 2 6 1 Convolutional Neural Networks 56 References 62 vi CONTENTS I Knowledge Transfer from Text Mining and Compu- tational Biology to Computer Vision 71 3 Object Recognition using the Spatial Non-Alignment Kernel 72 3 1 Introduction 73 3 2 Related Work 75 3 2 1 Encoding Spatial Information 76 3 3 Bag-of-Visual-Words Model 77 3 4 Spatial Non-Alignment Kernel 80 3 4 1 Translation and Size Invariance 82 3 5 Object Recognition Experiments 84 3 5 1 Data Sets Description 85 3 5 2 Implementation and Evaluation Procedure 85 3 5 3 Parameter Tuning 88 3 5 4 Results on Pascal VOC Experiment 89 3 5 5 Results on Birds Experiment 91 3 6 Discussion 93 References 93 4 Gesture Recognition using Local Frame Match Distance 98 4 1 Introduction 99 4 2 Related Work 100 4 3 Method 101 4 3 1 Feature Representation of Hand Gestures 101 4 3 2 Local Frame Match Distance 102 4 3 3 Learning Methods 107 4 4 Experiments and Results 108 4 4 1 Data Set 108 4 4 2 Evaluation 108 4 4 3 Results 109 4 5 Discussion 110 References 110 vii CONTENTS 5 Abnormal Event Detection using Unmasking and Narrowed Mo- tion Clusters 115 5 1 Introduction 116 5 2 Related Work 121 5 3 Unsupervised Method 124 5 3 1 Features 125 5 3 2 Change Detection by Unmasking 127 5 4 Supervised Method 129 5 4 1 Feature Extraction 129 5 4 2 Two-Stage Outlier Detection 130 5 5 Experiments 134 5 5 1 Data Sets 134 5 5 2 Evaluation 135 5 5 3 Implementation Details 136 5 5 4 Results on the Avenue Data Set 137 5 5 5 Results on the Subway Data Set 141 5 5 6 Results on the UCSD Data Set 144 5 5 7 Results on the UMN Data Set 147 5 6 Discussion 151 References 152 II Knowledge Transfer from Computer Vision to Text Mining and Computational Biology 158 6 Sequence Alignment using Local Rank Distance 159 6 1 Introduction 160 6 2 Local Rank Distance Deﬁnition 163 6 3 Local Rank Distance Algorithm 165 6 4 Local Rank Distance Sequence Aligners 168 6 4 1 Indexing Strategies and Eﬃciency Improvements 170 6 5 Experiments and Results 173 6 5 1 Data Sets Description 173 6 5 2 Alignment in the Presence of Contaminated Reads 175 viii CONTENTS 6 5 3 Clustering an Unknown Organism 184 6 5 4 Time Evaluation of Sequence Aligners 189 6 5 5 Experiment on Vibrio Species 191 6 6 Discussion 193 References 196 7 Text Categorization by Topic using Spatial Information 201 7 1 Introduction 202 7 2 Related Work 204 7 3 Methods to Encode Spatial Information 206 7 3 1 Spatial Pyramid for Text 207 7 3 2 Spatial Non-Alignment Kernel for Text 208 7 4 Experiments 211 7 4 1 Data Sets Description 211 7 4 2 Implementation Choices 212 7 4 3 Evaluation Procedure 212 7 4 4 Results on Reuters-21578 Corpus 214 7 4 5 Results on 20 Newsgroups 215 7 5 Discussion 216 References 217 8 Text Classiﬁcation using Bag-of-Super-Word-Embeddings 221 8 1 Introduction 222 8 2 Related Work 225 8 2 1 Bag-of-Visual-Words 225 8 2 2 Word Embeddings 225 8 3 Bag-of-Super-Word-Embeddings 227 8 3 1 Implementation Details 230 8 3 2 Combination with String Kernels 231 8 4 Polarity Classiﬁcation Experiments 231 8 4 1 Data Set 231 8 4 2 Baselines 232 8 4 3 Results 232 ix CONTENTS 8 5 Text Categorization Experiments 234 8 5 1 Data Set 234 8 5 2 Baseline 234 8 5 3 Evaluation Procedure 235 8 5 4 Results 236 8 6 Automatic Essay Scoring Experiments 237 8 6 1 Data Set 237 8 6 2 Evaluation Procedure 238 8 6 3 Baselines 239 8 6 4 Implementation Choices 239 8 6 5 In-Domain Results 239 8 6 6 Cross-Domain Results 240 8 7 Discussion 242 References 243 9 Word Sense Disambiguation using ShotgunWSD 251 9 1 Introduction 252 9 2 Related Work 255 9 3 Method 256 9 3 1 Semantic Relatedness 261 9 3 1 1 Extended Lesk Measure 262 9 3 1 2 Sense Embeddings 263 9 3 1 3 Sense Embeddings after Outlier Removal 264 9 4 Experiments and Results 267 9 4 1 Data Sets 267 9 4 2 Parameter Tuning 268 9 4 3 Results on SemEval 2007 271 9 4 4 Results on Senseval-2 272 9 4 5 Results on Senseval-3 273 9 4 6 Results on SemEval 2015 274 9 5 Discussion 275 References 276 x CONTENTS 10 Conclusions and Future Work 282 10 1 Discussion and Conclusions 282 10 2 Future Work 285 References 289 List of Figures 294 List of Tables 300 xi Chapter 1 Motivation and Overview Abstract This chapter gives a brief overview of machine learning and related ﬁelds of study The concept of treating image and text data in a similar fashion is then presented A few successful examples of knowledge transfer between computer vision and text mining are also discussed The chapter ends with a full overview of the organization of this thesis 1 1 Introduction Machine learning is a branch of artiﬁcial intelligence that studies computer sys- tems that can learn from data In this context, learning is about recognizing complex patterns and making intelligent decisions based on data In the early years of artiﬁcial intelligence, the idea that human thinking could be rendered logically in a numerical computing machine emerged, but it was unclear if such a machine could model the complex human brain, until Alan Turing proposed a test to measure its performance in 1950 The Turing test states that a machine exhibits human-level intelligence if a human judge engages in a natural language conversation with the machine and cannot distinguish it from another human Despite the fact that intelligent machines that can pass the Turing test have not been developed yet, many interesting and useful systems that can learn from data 1 have been proposed since then One of the ﬁrst breakthrough intelligent system was developed in 1952 by Arthur Samuel from IBM He developed a game-playing program, for checkers, to achieve suﬃcient skill to challenge a world champion Its program was based on a search tree of the board positions reachable from the current state Some of the early intelligent systems were based on decision rules Such systems are best known as expert systems The system that is often called the ﬁrst expert system is ELIZA, which was developed between 1964 and 1966 by Joseph Weizenbaum from MIT ELIZA simulated a psychotherapist that could interact with a human patient It was implemented using simple pattern matching techniques like string substitution and canned responses based on keywords What is interesting to note is that when ELIZA originally appeared, some people actually mistook it for a human At the same time with the development of expert systems, other approaches have been proposed In 1957, Frank Rosenblatt invented theper- ceptron[Rosenblatt, 1957] which is a mathematical model of the neuron The perceptron is a very simple linear classiﬁer, but it was shown that a powerful model can be created by combining perceptrons into a network Despite the fact that neural network research went through many years of stagnation, the ﬁeld was revived when thebackpropagationalgorithm [Rumelhart et al , 1986] used for training neural networks became widely popular in the artiﬁcial intelligence community In the early 90's the ﬁeld of machine learning shifted to a more data-driven approach as compared to the more knowledge-driven expert systems, mainly due to the intersection of computer science and statistics Many of the current machine learning approaches are based on the ideas developed at that time A complete history of artiﬁcial intelligence is presented in [Nilsson, 2010] Several learning paradigms have been proposed in the context of machine learning The two most popular ones are supervised and unsupervised learning Supervised learning refers to the task of building a classiﬁer using labeled training data The most studied approaches in machine learning are supervised and they include: Support Vector Machines [Cortes & Vapnik, 1995], Nave Bayes classi- ﬁers [Manning et al , 2008], neural networks [Bishop, 1995; Krizhevsky et al , 2012; LeCun et al , 2015], Random Forests [Breiman, 2001] and many others [Caruana & Niculescu-Mizil, 2006] Unsupervised learning refers to the task of ﬁnding hid- 2 den structure in unlabeled data The best known form of unsupervised learning iscluster analysis, which aims at clustering objects into groups based on their similarity Among the other learning paradigms aresemi-supervised learning, which combines both labeled and unlabeled data, andreinforcement learning, which learns to take actions in an environment in order to maximize a long-term reward Depending on the desired outcome of the machine learning algorithm or on the type of training input available for an application, a particular learning paradigm may be more suitable than the others Machine learning is currently a vast area of research with applications in a broad range of ﬁelds, such as computer vision [Fei-Fei & Perona, 2005; Forsyth & Ponce, 2002; Sebastiani, 2002; Zhang et al , 2007], bioinformatics [Dinu & Ionescu, 2013; Inza et al , 2010; Leslie et al , 2002], information retrieval [Chifu & Ionescu, 2012; Ionescu et al , 2015b; Manning et al , 2008], natural language processing [Lodhi et al , 2002; Popescu & Grozea, 2012; Sebastiani, 2002] and many others [Ionescu et al , 2015a] Among the variety of state-of-the-art machine learning approaches for such applications are deep learning [Goodfellow et al , 2016] and similarity-based learning [Chen et al , 2009] methods On one hand, this thesis studies similarity-based learning approaches such as nearest neighbor models, kernel methods [Shawe-Taylor & Cristianini, 2004] and clustering algorithms On the other hand, some of the presented methods are based on deep learning approaches such as convolutional neural networks [Chat- ﬁeld et al , 2014; Simonyan & Zisserman, 2014] The studied approaches have interesting applications and exhibit state-of-the-art performance levels in two dif- ferent areas: computer vision and string processing It is important to note that, in this thesis, string processing refers to any task that needs to process string data such as text documents, DNA sequences, and so on This work investigates string processing tasks ranging from genome sequence alignment [Dinu et al , 2014] to automatic essay scoring [Cozma et al , 2018], word sense disambiguation [But- naru et al , 2017] and text categorization by topic [Butnaru & Ionescu, 2017], from a machine learning perspective These tasks belong to one of two separate ﬁelds, namely text mining or computational biology, but they are gathered un- der one umbrella called string processing In a similar manner, we discuss about several computer vision tasks, including object recognition [Ionescu & Popescu, 3 2015], abnormal event detection in video [Ionescu et al , 2017b, 2018] and gesture recognition [Ionescu et al , 2017a] While all the topics enumerated so far seem to be unrelated, each and every one of them includes at least a concept that is borrowed from the other ﬁelds of study covered by this thesis In the following section, we provide further details about the transfer of knowledge between do- mains Before diving into the next section, it is worth mentioning that the core part of this thesis is mostly based on recently published works by the author, yet, it also includes (previously) unpublished work and results 1 2 Knowledge Transfer between Image and Text Nowadays, computer science specialists are faced with the challenge of processing massive amounts of data The largest part of this data is actually unstructured and semi-structured data, available in the form of text documents, images, au- dio ﬁles, video ﬁles and so on Researchers have developed methods and tools that extract relevant information and support eﬃcient access to unstructured and semi-structured content Such methods that aim at providing access to informa- tion are mainly studied by researchers in machine learning and related ﬁelds In fact, a tremendous amount of eﬀort has been dedicated to this line of re- search [Agarwal & Roth, 2002; Goodfellow et al , 2016; Lazebnik et al , 2005, 2006; Leung & Malik, 2001; Manning et al , 2008] In the context of machine learning, the aim is to obtain a good representation of the data that can later be used to build an eﬃcient classiﬁer In computer vision, image representations are obtained by feature detection and feature extraction Most of the feature ex- traction methods are handcrafted by researchers that have a good understanding of the application and a vast experience This is the case of the bag-of-visual- words model [Csurka et al , 2004; Leung & Malik, 2001; Sivic et al , 2005] in computer vision A diﬀerent approach is representation learning, which aims at discovering a better representation of the data provided during training This is the case of deep learning algorithms [Bengio, 2009; Goodfellow et al , 2016; LeCun et al , 2015; Montavon et al , 2012] that aim at discovering multiple levels of representation, or a hierarchy of features Deep algorithms learn to transform one representation into another, by better disentangling the factors of variation 4 (a) A picture of a kitchen glove (b) A picture of the same glove with context Figure 1 1: An example in which the context helps to disambiguate an object (kitchen glove), which can easily be mistaken for something else if the rest of the image is not seen The image belongs to the Pascal VOC 2007 data set that explain the observed data Whether the representation of the data is obtained through a handcrafted method or learned by a fully automatic process, common concepts of treating diﬀerent kinds of unstructured and semi-structured data, such as image and text, naturally arise Despite the fact that computer vision and string processing seem to be unrelated ﬁelds of study, the concept of treating image and text in a similar fashion has proven to be very fertile for several applications Furthermore, by adapting string processing techniques to image analysis or the other way around, knowledge from one domain can be transferred to the other An example of similarity between text and image is discussed next It refers to word sense disambiguation and object recognition in images Word sense disambiguation(WSD) is a core research problem in computational linguistics and natural language processing, which was recognized since the beginning of the scientiﬁc interest in machine translation, and in artiﬁcial intelligence, in general WSD is about determining the meaning of a word in a speciﬁc context Actually all the WSD methods use the context to determine the meaning of an ambiguous word, because the entire information about the word sense is contained in the context [Agirre & Edmonds, 2006] The basic concept is to extract features from the context that could help the WSD process In a similar fashion, an 5 object in an image can be recognized using the entire image as a context For example, a method that could detect the presence of a kitchen glove in the image, would have to look for distinctive features such as the texture of the material, shape, and perhaps even color However, there could be other objects that have similar shape or color, and in more diﬃcult situations, such as illustrated in Figure 1 1(a), it may be almost impossible to distinguish the glove Thus, a better approach could be to look for other distinctive features in the image provided by the context For instance, a human can easily ﬁgure out that a glove is hanging by a kitchen cabinet knob in the scene illustrated in Figure 1 1(b) It is more easy to understand the entire scene as a whole than taking the glove out of context In conclusion, the idea of using the context can help to avoid any confusion Not surprisingly, this intuitive idea has already been studied in the computer vision literature [Galleguillos & Belongie, 2010; Rabinovich et al , 2007] In [Rabinovich et al , 2007], the semantic context is incorporated into object categorization to reduce ambiguity in objects' visual appearance and improve accuracy The paper of [Galleguillos & Belongie, 2010] goes even further and makes a distinction between three types of context, namely semantic context, spatial context and scale context Another example of treating image and text in a similar manner is a state- of-the-art method for image categorization and image retrieval inspired from the bag-of-wordsrepresentation, which is very popular in information retrieval and natural language processing The bag-of-words model represents a text as an un- ordered collection of words, completely disregarding grammar, word order, and syntactic groups The bag-of-words model has many applications from informa- tion retrieval [Manning et al , 2008] to natural language processing [Manning & Schutze, 1999] and word sense disambiguation [Agirre & Edmonds, 2006; Chifu & Ionescu, 2012] Words are the atoms that form coherent text In the context of image analysis, the concept of wordneeds to be somehow deﬁned, before be- ing able to use a bag-of-words representation As illustrated in Figure 1 2, small repetitive patterns can be observed in images, however these atomic patterns are not identical in all locations in the image, since they are aﬀected by various translations, rotations, illumination changes, noise and so on In this context, computer vision researchers have deﬁned the concept of visual wordas a group 6 Figure 1 2: An example of repetitive local image patterns that form the building blocks of the bag-visual-words model of similar (but not necessarily identical) local image patterns in order to cope with possible image transformations To obtain a vocabulary of visual words, local image descriptors such as SIFT [Lowe, 1999, 2004] are usually obtained by vector quantization The vector quantization process can be done, for example, by k-means clustering [Leung & Malik, 2001] or by probabilistic Latent Semantic Analysis [Sivic et al , 2005] The frequency of each visual word is then recorded 7 Figure 1 3: An object that can be described by multiple categories such as toy, bear, or both in a histogram which represents the ﬁnal feature vector for the image This his- togram is the equivalent of the bag-of-words representation for text The idea of representing images asbag-of-visual-wordshas demonstrated very good per- formance for image categorization [Zhang et al , 2007], image retrieval [Philbin et al , 2007] and facial expression recognition [Ionescu et al , 2013] One of the most important problems in computer vision is object recognition Machine learning methods represent the state-of-the-art approach for the object recognition problem A common approach is to make some assumptions in order to treat object recognition as a classiﬁcation problem First, object categories are considered to be ﬁxed and known Second, each instance belongs to a single category However, some researchers argue that these assumptions do not ade- quately describe the reality The following example shows that these assumptions are indeed wrong The object presented in Figure 1 3 can be described either as a toy, a bear, or both It is clear that the object does not belong to a single cat- egory Furthermore, the category of the object might be irrelevant for particular applications Another drawback of this approach is that it misses out some of the subtle aspects of object recognition For example, an object classiﬁcation system does not understand the properties of an object and it cannot deal with unfamil- iar objects In other words, it fails to extract aspects of meaning Thus, some 8 computer vision researchers have proposed diﬀerent approaches for the object recognition task One alternative approach, proposed in [Duygulu et al , 2002], is to model object recognition as machine translation The model is based on the observation that object recognition is a little like translation, in that a picture (or text in a source language) goes in, and a description (or text in a target language) comes out In this model, object recognition becomes a process of annotating im- age regions with words First, images are segmented into regions, which are then classiﬁed into region types Next, a mapping between region types and keywords provided with the images is learned This process is similar to learning alexi- confrom data, a standard problem in machine translation literature [Jurafsky & Martin, 2000; Manning & Schutze, 1999] This approach has proven fertile for this interpretation of object recognition Research in this area has led to the development of other systems, such as the one described in [Farhadi et al , 2010] which generates sentences from images The system computes a score linking an image to a sentence This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence To take this even further, the work of [Sadeghi & Farhadi, 2011] suggests that it is easier and more eﬀective to generate descriptions of images in terms of chunks of meaning, such as \a person riding a horse", rather than individual components, such as \person" or \horse" In this approach, categories are replaced with visual phrases for recognition The examples described so far are successful cases of treating image data as text data However, research that studies how to improve text processing techniques with knowledge from computer vision has also been conducted A good example is the method introduced in [Barnard & Johnson, 2005], which proposes the use of images for WSD, either alone, or in conjunction with traditional text based methods To integrate image information with text data, the authors exploit previous work on linking images and words [Barnard et al , 2003; Duygulu et al , 2002] The empirical results strongly suggest that images can help to disambiguate senses of words In the recent years, deep neural networks have demonstrated impressive levels of performance in various computer vision tasks [Krizhevsky et al , 2012; LeCun et al , 2015; Simonyan & Zisserman, 2014] After their success in computer vision, 9 researchers have tried to adapt the deep learning algorithms for text data [John- son & Zhang, 2015; Mikolov et al , 2013; Sutskever et al , 2014] One of the most popular approaches is to use neural models in order to build word embed- dings [Mikolov et al , 2013] by mapping words from a vocabulary to vectors of real value numbers in a low dimensional space Although deep learning algorithms are now widely used in the NLP community, others have suggested that equally good word embeddings can be produced without the help of deep models, for example, by using Hellinger Principal Component Analysis [Lebret et al , 2013] Nevertheless, deep learning models, which essentially promote the idea ofend- to-endlearning, are now extremely popular in both computer vision and natural language processing, which only comes to support the main argument behind the present thesis, namely that knowledge transfer between these domains is fruitful The concept of treating image and text in a similar manner is exploited in some way or another in the previous examples found in literature Several other examples presented by Ionescu et al [Ionescu & Popescu, 2016a] show that knowl- edge transfer from one domain to another has proven to be very fertile in the case of computer vision and natural language processing This thesis follows the same line of research and presents novel approaches or improved methods that rely on this cornerstone concept Computer vision researchers have demonstrated that the object recognition performance can be improved by including spatial information into the bag-of- visual-words model A state-of-the-art approach is the spatial pyramid represen- tation [Lazebnik et al , 2006], which divides the image into spatial bins In Chap- ter 3, we present an improvement to the popular bag-of-visual-words model The improvement consists of employing the Spatial Non-Alignment Kernel to encode spatial information in the similarity of two images Compared to the widely-used spatial pyramid representation [Lazebnik et al , 2006], it improves performance while consuming less space and time Not only that the bag-of-visual-words itself is transferred from text analysis, but also the Spatial Non-Alignment Kernel is inspired by rank distance [Dinu & Manea, 2006], a distance measure for strings In Chapter 7, the spatial pyramid and the Spatial Non-Alignment Kernel are used to signiﬁcantly improve the performance of the bag-of-words model in the context of text categorization by topic, showing that spatial information can also 10 be useful for text analysis In our previous work [Dinu et al , 2012; Ionescu & Popescu, 2013], we in- troduced a dissimilarity measure for images termed Local Patch Dissimilarity The dissimilarity measure was inspired from the rank distance measure [Dinu & Manea, 2006] The main concern was to extend rank distance from one- dimensional input (strings) to two-dimensional input (digital images) While rank distance is a highly accurate measure for strings, the experiments presented in [Dinu et al , 2012; Ionescu & Popescu, 2013] suggest that the proposed exten- sion of rank distance to images is very accurate for handwritten digit recognition and, with some redesign work, for texture analysis [Ionescu et al , 2014] Based on the same developments as Local Patch Dissimilarity, we present a novel distance measure for strings in Chapter 6 Designed to conform to more general principles, while being better adapted for speciﬁc data types, such as DNA strings or text, it shows interesting results in DNA sequence alignment In Chapter 4, we present the Local Frame Match Distance, a novel approach for matching gestures inspired by Local Rank Distance While Local Rank Distance eﬃciently approximates the non-alignment of character n-grams between two strings, we employ the Local Frame Match Distance to eﬃciently measure the non-alignment of hand loca- tions between two video sequences Furthermore, we transform the Local Frame Match Distance into a kernel and use it in combination with Kernel Discriminant Analysis for sign language recognition with exemplars In Chapter 5, we describe an unsupervised approach as well as a supervised approach for abnormal event detection in video The unsupervised approach is based on unmasking [Koppel et al , 2007], a technique previously used for au- thorship veriﬁcation in text documents, which has been adapted to the abnormal event detection task The supervised approach is based on a two-stage algorithm After extracting motion features from the training video containing only normal events, the algorithm applies k-means clustering to ﬁnd clusters representing dif- ferent types of motion In the ﬁrst stage, we consider that clusters with fewer samples (with respect to a given threshold) contain only outliers and we elim- inate these clusters altogether In the second stage, we shrink the borders of the remaining clusters by training a one-class Support Vector Machines model on each cluster In Chapter 9, we present an unsupervised algorithm for word 11 sense disambiguation at the document-level that eliminates outlier word senses using a similar approach For each sense of word, we collect all the words from the corresponding WordNet [Fellbaum, 1998; Miller, 1995] synset, gloss and re- lated synsets, into a sense bag We embed the collected words from all the sense bags in a document into a vector space using a common word embedding frame- work [Mikolov et al , 2013] The word vectors are then clustered using k-means to form clusters of semantically related words At this stage, we eliminate clusters with fewer samples as they likely to represent outliers Words from the eliminated clusters are also removed from each and every sense bag Finally, the algorithm computes the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense In Chapter 8, we present a novel approach for text classiﬁcation based on clustering word embeddings, inspired by the bag-of-visual-words model After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model [Mikolov et al , 2013], a k-means algorithm is applied on the word vectors in order to obtain a ﬁxed-size set of clusters The centroid of each cluster is interpreted as a super word embeddingthat embodies all the semantically related word vectors in a certain region of the embedding space Every embedded word in the collection of documents is then assigned to the nearest cluster centroid In the end, each document is represented as a bag-of-super-word-embeddings by computing the frequency of each super word em- bedding in the respective document The bag-of-super-word-embeddings shows its usefulness in three text mining tasks, namely text categorization by topic, polarity classiﬁcation and automated essay scoring In the latter task, the bag- of-super-word-embeddings framework is combined with the intersection string kernel, which is based on character n-grams Notably, the intersection kernel has successfully been used in computer vision for object class recognition from images [Maji et al , 2008; Vedaldi & Zisserman, 2010] To summarize, all the contributions presented in this thesis are based on the cornerstone concept of treating image and text in a similar manner More- over, there are several other contributions [Barnard & Johnson, 2005; Duygulu et al , 2002; LeCun et al , 2015] that transfer knowledge between computer vi- sion and text mining, but altogether, this concept is far from saturated When a 12 break-through discovery is made in one domain, researchers can always consider adapting and using the respective discovery into another domain, even though their attempt may not necessarily prove to be successful in the end 1 3 Overview and Organization The rest of this thesis is organized as follows All the machine learning methods that are employed to obtain results for diﬀerent applications in computer vision and string processing are described in Chapter 2 The chapter gives an overview of the main concepts of learning based on similarity as well as deep learning Speciﬁc machine learning methods that are based on these concepts are then presented First, nearest neighbor models are discussed An overview of kernel methodsis given next, since the state-of-the-art methods consistently used in the supervised learning tasks presented throughout this thesis are kernel methods Chapter 2 continues with a discussion about cluster analysis Clustering techniques are used throughout this thesis in various contexts, from building vocabularies of visual words to outlier detection Chapter 2 ends with a discussion about deep learning, giving special attention to convolutional neural networks Convolutional neural networks are employed to extract deep appearance features useful for abnormal event detection in video The main content of this thesis is organized in two parts Part I presents machine learning methods and applications in computer vision that are based on knowledge and concepts borrowed from text mining Part II presents machine learning methods and applications in text and string processing, or more pre- cisely, in computational biology and text mining These are based on concepts transferred from computer vision Chapters 3, 4 and 5 belong to Part I, while Chapters 6, 7, 8 and 9 belong to Part II Finally, the conclusions are drawn in Chapter 10 The content of each chapter is brieﬂy discussed next Chapter 3 presents the bag-of-visual-words model along with some improve- ments for object recognition in images For the bag-of-visual-words approach, images are represented as histograms of visual words from a codebook that is usually obtained with a simple clustering method Next, kernel methods are used to compare such histograms Researchers have demonstrated that the ob- 13 ject recognition performance with the bag-of-visual-words can be improved by including spatial information A state-of-the-art approach is the spatial pyra- mid representation [Lazebnik et al , 2006], which divides the image into spatial bins In Chapter 3, we describe another general approach that encodes the spa- tial information in a much better and eﬃcient way The approach is to embed the spatial information into a kernel function termed the Spatial Non-Alignment Kernel (SNAK) [Ionescu & Popescu, 2015] For each visual word, the average position and the standard deviation is computed based on all the occurrences of the visual word in the image These are computed with respect to the center of the object, which is determined with the help of the objectness measure [Alexe et al , 2010, 2012] The pairwise similarity of two images is then computed by taking into account the diﬀerence between the average positions and the diﬀer- ence between the standard deviations of each visual word in the two images In all the experiments, the SNAK framework shows a better recognition accuracy than the spatial pyramid Gesture recognition using a training set of limited size for a large vocabulary of gestures is a challenging problem in computer vision With few examples per gesture class, researchers often employ state-of-the-art exemplar-based methods such as Dynamic Time Warping (DTW) [Conly et al , 2016] Chapter 4 presents two contributions in the area of exemplar-based gesture recognition As an al- ternative to Dynamic Time Warping, we ﬁrst present the Local Frame Match Distance (LFMD) [Ionescu et al , 2017a], a novel approach for matching gestures inspired by a distance measure for strings, namely Local Rank Distance [Ionescu, 2013] While Local Rank Distance eﬃciently approximates the non-alignment of character n-grams between two strings, we employ the Local Frame Match Distance to eﬃciently measure the non-alignment of hand locations between two video sequences Second of all, we transform the Local Frame Match Distance into a kernel and use it in combination with Kernel Discriminant Analysis for sign language recognition with exemplars The empirical results indicate that our method can generally yield better performance than a state-of-the-art Dy- namic Time Warping approach [Conly et al , 2016] on the challenging task of American Sign Language recognition, while reducing the computational time by 30% 14 In Chapter 5, we present two methods for abnormal event detection in video The ﬁrst method [Ionescu et al , 2017b] is unsupervised, as it requires no train- ing sequences The unsupervised method is based on unmasking [Koppel et al , 2007], a technique previously used for authorship veriﬁcation in text documents, which we adapt to our task The method iteratively trains a binary classiﬁer to distinguish between two consecutive video sequences while removing at each step the most discriminant features Higher training accuracy rates of the intermedi- ately obtained classiﬁers represent abnormal events The second method [Ionescu et al , 2018] is supervised and it approaches the abnormal event detection problem as an outlier detection task The supervised method is composed of a two-stage algorithm based on k-means clustering and one-class Support Vector Machines (SVM) to eliminate outliers After extracting motion features from the training video containing only normal events, we apply k-means clustering to ﬁnd clusters representing diﬀerent types of motion In the ﬁrst stage, we consider that clusters with fewer samples (with respect to a given threshold) contain only outliers and we eliminate these clusters altogether In the second stage, we shrink the borders of the remaining clusters by training a one-class SVM model on each cluster To detected abnormal events in the test video, we analyze each test sample and consider its maximum normality score provided by the trained one-class SVM models, based on the intuition that a test sample can belong to only one cluster of normal motion If the test sample does not ﬁt well in any narrowed cluster, than it is labeled as abnormal We also combine our approach based on motion features with a recent approach based on deep appearance features [Smeureanu et al , 2017] extracted with pre-trained convolutional neural networks We com- bine our two-stage algorithm with the deep framework using a late fusion strat- egy, keeping the pipelines of the two approaches independent We compare both methods with several state-of-the-art supervised and unsupervised methods on four benchmark data sets The empirical results presented in Chapter 5 indicate that the unsupervised abnormal event detection framework can achieve better results than a state-of-the-art unsupervised method [Del Giorno et al , 2016] In the same time, our supervised method achieves better results than all supervised methods [Cheng et al , 2015; Cong et al , 2011; Hasan et al , 2016; Hinami et al , 2017; Lu et al , 2013; Mehran et al , 2009; Ravanbakhsh et al , 2017; Saligrama & 15 Chen, 2012; Sun et al , 2017; Zhang et al , 2016] in most cases, while processing the test video in real-time at 32 frames per second on CPU In Chapter 6, we present Local Rank Distance (LRD), a distance measure that was initially presented in [Ionescu, 2013] The novel distance measure is designed to comprise more general principles than rank distance [Dinu & Manea, 2006], but it is also developed having a practical motivation in mind, speciﬁ- cally to be more suitable for DNA strings or text Chapter 6 describes a fast algorithm [Ionescu, 2015] for computing LRD and presents an application to se- quence alignment Genome sequence alignment refers to the task of assigning a set of short DNA reads to a reference genome As such, a genome sequence aligner based on LRD [Dinu et al , 2014] is described in Chapter 6 The LRD aligner presented in this thesis aims to improve correctness over speed However, some indexing strategies to speed up the aligner are also described In Chapter 7, two approaches to encode spatial information in the bag-of- visual-words model are transferred to text analysis These are the spatial pyra- mid [Lazebnik et al , 2006] and the Spatial Non-Alignment Kernel [Ionescu & Popescu, 2015] In the context of object recognition from images, the spatial in- formation helps to signiﬁcantly improve performance [Ionescu & Popescu, 2015; Lazebnik et al , 2006] The empirical results presented in Chapter 7 indicate that spatial information can also be useful for text categorization by topic The spatial pyramid for text divides the text into sections (or parts) using multiple levels of granularity and extracts features from each of these sections The ﬁnal representation, obtained by concatenating all the features, roughly indicates what features appear in a certain section of a text document, such as the introduction or the conclusion The Spatial Non-Alignment Kernel for text replaces visual words from images with words from text documents, providing a soft assignment alternative to the spatial pyramid As in the case of object recognition from im- ages, the Spatial Non-Alignment Kernel seems to be a better approach in terms of performance, probably because it represents the spatial information (location of words in text) in a more accurate way than the spatial pyramid In Chapter 8, we present a novel approach [Butnaru & Ionescu, 2017] for text classiﬁcation based on clustering word embeddings, inspired by the bag-of-visual- words model After each word in a collection of documents is represented as 16 word vector using a pre-trained word embeddings model [Mikolov et al , 2013], a k-means clustering algorithm is applied on the word vectors in order to obtain a ﬁxed-size set of clusters The centroid of each cluster is interpreted as asu- per word embeddingthat embodies all the semantically related word vectors in a certain region of the embedding space Every embedded word in the collec- tion of documents is then assigned to the nearest cluster centroid In the end, each document is represented as abag-of-super-word-embeddingsby computing the frequency of each super word embedding in the respective document We also diverge from the idea of building a single vocabulary for the entire collec- tion of documents, and propose to build class-speciﬁc vocabularies for better performance Using this kind of representation, we report results on two text mining tasks, namely text categorization by topic and polarity classiﬁcation On both tasks, our model yields better performance than the standard bag-of-words In Chapter 8, we also present an approach [Cozma et al , 2018] based on com- bining string kernels and word embeddings for automatic essay scoring String kernels capture the similarity among strings based on counting common char- acter n-grams, which are a low-level yet powerful type of feature demonstrating state-of-the-art results in various text classiﬁcation tasks such as Arabic dialect identiﬁcation [Ionescu & Butnaru, 2017; Ionescu & Popescu, 2016b] or native language identiﬁcation [Ionescu & Popescu, 2017; Ionescu et al , 2016] We com- bine string kernels with the high-level semantic feature representation provided by the bag-of-super-word-embeddings We evaluate our approach on the Auto- mated Student Assessment Prize data set, in both in-domain and cross-domain settings The empirical results indicate that our approach yields a better per- formance than several state-of-the-art approaches [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018] In Chapter 9, we present a recent unsupervised and knowledge-based algo- rithm for global word sense disambiguation (WSD) The algorithm, known as ShotgunWSD [Butnaru et al , 2017], is inspired by the Shotgun sequencing tech- nique, which is a broadly-used whole genome sequencing approach ShotgunWSD performs WSD at the document level based on three phases The ﬁrst phase con- sists of applying a brute-force WSD algorithm on short context windows selected from the document in order to generate a short list of likely sense conﬁgurations 17 REFERENCES for each window The second phase consists of assembling the local sense conﬁgu- rations into longer composite conﬁgurations by preﬁx and suﬃx matching In the third phase, the resulted conﬁgurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top conﬁgurations in which the respective word appears In Chapter 9, we also present an improved version (2 0) of ShotgunWSD which is based on a dif- ferent approach for computing the semantic relatedness score between two word senses, a step that stays at the core of building better local sense conﬁgurations For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework [Mikolov et al , 2013] The word vectors are then clus- tered using k-means to form clusters of semantically related words At this stage, we consider that clusters with fewer samples represent outliers and we eliminate these clusters altogether Words from the eliminated clusters are also removed from each and every sense bag Finally, we compute the median of all the remain- ing word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense We compare the improved ShotgunWSD algorithm (version 2 0) with its previous version (1 0) as well as several state-of-the-art unsupervised WSD algorithms We demonstrate that ShotgunWSD 2 0 yields better performance on four data sets Furthermore, our algorithm outperforms the strong Most Common Sense (MCS) baseline on one data set, a remarkable achievement for an unsupervised learning technique The conclusions presented in Chapter 10 point to the fact that the concept of treating image and text in a similar way is indeed fertile In the ﬁnal chapter, we also provide some general guidelines on future work and discuss new directions that could arise by transferring knowledge between computer vision, text mining and computational biology References Agarwal, Shivani and Roth, Dan Learning a Sparse Representation for Object Detection In Proceedings of ECCV, pp 113{127, Copenhagen, Denmark, June 18 REFERENCES 2002 (cited on 4) Agirre, Eneko and Edmonds, Philip Glenny Word Sense Disambiguation: Algo- rithms and Applications Springer, 2006 (cited on 5, 6) Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio What is an object? In Proceedings of CVPR, pp 73{80, San Francisco, CA, USA, June 2010 IEEE (cited on 14) Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio Measuring the object- ness of image windows IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189{2202, 2012 (cited on 14) Barnard, Kobus and Johnson, Matthew Word sense disambiguation with pic- tures Artiﬁcial Intelligence, 167(1-2):13{30, September 2005 (cited on 9, 12) Barnard, Kobus, Duygulu, Pinar, Forsyth, David, de Freitas, Nando, Blei, David M , and Jordan, Michael I Matching words and pictures Journal of Machine Learning Research, 3:1107{1135, March 2003 (cited on 9) Bengio, Yoshua Learning deep architectures for AI Foundations and Trends in Machine Learning, 2(1):1{127, 2009 (cited on 4) Bishop, Christopher M Neural Networks for Pattern Recognition Oxford Uni- versity Press, Inc , New York, NY, USA, 1995 (cited on 2) Breiman, Leo Random forests Machine Learning, 45(1):5{32, October 2001 (cited on 2) Butnaru, Andrei and Ionescu, Radu Tudor From Image to Text Classiﬁcation: A Novel Approach based on Clustering Word Embeddings In Proceedings of KES, pp 1784{1793, 2017 (cited on 3, 16) Butnaru, Andrei, Ionescu, Radu Tudor, and Hristea, Florentina ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing In Proceedings of EACL, pp 916{926, 2017 (cited on 3, 17) 19 REFERENCES Caruana, Rich and Niculescu-Mizil, Alexandru An empirical comparison of su- pervised learning algorithms In Proceedings of ICML, pp 161{168, New York, NY, USA, 2006 (cited on 2) Chatﬁeld, K , Simonyan, K , Vedaldi, A , and Zisserman, A Return of the Devil in the Details: Delving Deep into Convolutional Nets In Proceedings of BMVC, 2014 (cited on 3) Chen, Yihua, Garcia, Eric K , Gupta, Maya R , Rahimi, Ali, and Cazzanti, Luca Similarity-based Classiﬁcation: Concepts and Algorithms Journal of Machine Learning Research, 10:747{776, June 2009 (cited on 3) Cheng, Kai-Wen, Chen, Yie-Tarng, and Fang, Wen-Hsien Video anomaly de- tection and localization using hierarchical feature representation and Gaussian process regression In Proceedings of CVPR, pp 2909{2917, 2015 (cited on 15) Chifu, Adrian-Gabriel and Ionescu, Radu Tudor Word sense disambiguation to improve precision for ambiguous queries Central European Journal of Com- puter Science, 2(4):398{411, 2012 (cited on 3, 6) Cong, Y , Yuan, J , and Liu, J Sparse reconstruction cost for abnormal event detection In Proceedings of CVPR, pp 3449{3456, 2011 (cited on 15) Conly, Christopher, Dillhoﬀ, Alex, and Athitsos, Vassilis Leveraging intra-class variations to improve large vocabulary gesture recognition In Proceedings of ICPR, pp 907{912, 2016 (cited on 14) Cortes, Corinna and Vapnik, Vladimir Support-Vector Networks Machine Learning, 20(3):273{297, 1995 (cited on 2) Cozma, Madalina, Butnaru, Andrei, and Ionescu, Radu Tudor Automated essay scoring with string kernels and word embeddings In Proceedings of ACL, pp 503{509, 2018 (cited on 3, 17) Csurka, Gabriella, Dance, Christopher R , Fan, Lixin, Willamowski, Jutta, and Bray, Cdric Visual categorization with bags of keypoints InProceedings of Workshop on Statistical Learning in Computer Vision at ECCV, pp 1{22, 2004 (cited on 4) 20 REFERENCES Del Giorno, Allison, Bagnell, J Andrew, and Hebert, Martial A Discriminative Framework for Anomaly Detection in Large Videos In Proceedings of ECCV, pp 334{349, October 2016 (cited on 15) Dinu, Liviu P and Ionescu, Radu Tudor Clustering based on Median and Closest String via Rank Distance with Applications on DNA Neural Computing and Applications, 24(1):77{84, 2013 (cited on 3) Dinu, Liviu P and Manea, Florin An eﬃcient approach for the rank aggregation problem Theoretical Computer Science, 359(1{3):455{461, 2006 (cited on 10, 11, 16) Dinu, Liviu P , Ionescu, Radu Tudor, and Popescu, Marius Local Patch Dissimi- larity for Images In Proceedings of ICONIP, volume 7663, pp 117{126 LNCS Springer-Verlag, 2012 (cited on 11) Dinu, Liviu P , Ionescu, Radu Tudor, and Tomescu, Alexandru I A rank-based sequence aligner with applications in phylogenetic analysis PLoS ONE, 9(8): e104006, 08 2014 doi: 10 1371/journal pone 0104006 (cited on 3, 16) Dong, Fei and Zhang, Yue Automatic Features for Essay Scoring { An Empirical Study In Proceedings of EMNLP, pp 1072{1077, 2016 (cited on 17) Dong, Fei, Zhang, Yue, and Yang, Jie Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring In Proceedings of CONLL, pp 153{162, 2017 (cited on 17) Duygulu, P , Barnard, Kobus, Freitas, J F G de, and Forsyth, David A Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary InProceedings of ECCV, pp 97{112, London, UK, UK, 2002 Springer-Verlag (cited on 9, 12) Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David Every picture tells a story: generating sentences from images In Proceedings of ECCV, pp 15{29, Berlin, Heidelberg, 2010 Springer-Verlag (cited on 9) 21 REFERENCES Fei-Fei, Li and Perona, Pietro A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories In Proceedings of CVPR, volume 2, pp 524{531, Wash- ington, DC, USA, 2005 IEEE Computer Society (cited on 3) Fellbaum, Christiane (ed ) WordNet: An Electronic Lexical Database MIT Press, 1998 (cited on 12) Forsyth, David A and Ponce, Jean Computer Vision: A Modern Approach Prentice Hall Professional Technical Reference, 2002 (cited on 3) Galleguillos, Carolina and Belongie, Serge Context Based Object Categorization: A Critical Survey Computer Vision and Image Understanding, 114:712{722, 2010 (cited on 6) Goodfellow, Ian, Courville, Aaron, and Bengio, Yoshua Deep Learning MIT Press, 2016 URLhttp://www deeplearningbook org (cited on 3, 4) Hasan, Mahmudul, Choi, Jonghyun, Neumann, Jan, Roy-Chowdhury, Amit K , and Davis, Larry S Learning temporal regularity in video sequences In Pro- ceedings of CVPR, pp 733{742, 2016 (cited on 15) Hinami, Ryota, Mei, Tao, and Satoh, Shin'ichi Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge In Proceedings of ICCV, pp 3639{3647, 2017 (cited on 15) Inza, I~naki, Calvo, Borja, Arma~nanzas, Ruben, Bengoetxea, Endika, Larra~naga, Pedro, and Lozano, Jose A Machine learning: an indispensable tool in bioinfor- matics Methods in Molecular Biology (Clifton, N J ), 593:25{48, 2010 (cited on 3) Ionescu, Radu Tudor Local Rank Distance InProceedings of SYNASC, pp 221{228, Timisoara, Romania, 2013 IEEE Computer Society (cited on 14, 16) Ionescu, Radu Tudor A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identiﬁcation In Proceedings of ICONIP, volume 9490, pp 390{400 Springer LNCS, 2015 (cited on 16) 22 REFERENCES Ionescu, Radu Tudor and Butnaru, Andrei Learning to Identify Arabic and German Dialects using Multiple Kernels In Proceedings of VarDial Workshop of EACL, pp 200{209, 2017 (cited on 17) Ionescu, Radu Tudor and Popescu, Marius Speeding Up Local Patch Dissimilar- ity In Proceedings of ICIAP, volume 8156, pp 1{10, Heidelberg, 2013 LNCS Springer-Verlag (cited on 11) Ionescu, Radu Tudor and Popescu, Marius Have a SNAK Encoding Spatial Information with the Spatial Non-alignment Kernel In Proceedings of ICIAP, volume 9279, pp 97{108 Springer LNCS, 2015 (cited on 3, 14, 16) Ionescu, Radu Tudor and Popescu, Marius Knowledge Transfer between Com- puter Vision and Text Mining Advances in Computer Vision and Pattern Recognition Springer International Publishing, 2016a (cited on 10) Ionescu, Radu Tudor and Popescu, Marius UnibucKernel: An Approach for Arabic Dialect Identiﬁcation based on Multiple String Kernels In Proceedings of VarDial Workshop of COLING, pp 135{144, 2016b (cited on 17) Ionescu, Radu Tudor and Popescu, Marius Can string kernels pass the test of time in native language identiﬁcation? InProceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp 224{234, 2017 (cited on 17) Ionescu, Radu Tudor, Popescu, Marius, and Grozea, Cristian Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition In Proceedings of ICML Workshop on Challenges in Representation Learning, 2013 (cited on 8) Ionescu, Radu Tudor, Popescu, Andreea Lavinia, Popescu, Dan, and Popescu, Marius Local Texton Dissimilarity with Applications on Biomass Classiﬁca- tion In Proceedings of VISAPP, Lisbon, Portugal, January 2014 (cited on 11) Ionescu, Radu Tudor, Popescu, Andreea Lavinia, Popescu, Marius, and Popescu, Dan BiomassID: A Biomass Type Identiﬁcation System for Mobile Devices Computers and Electronics in Agriculture, 113:244{253, 2015a (cited on 3) 23 REFERENCES Ionescu, Radu Tudor, Popescu, Marius, and Cahill, Aoife String kernels for na- tive language identiﬁcation: Insights from behind the curtains Computational Linguistics, 42(3):491{525, 2016 (cited on 17) Ionescu, Radu Tudor, Popescu, Marius, Conly, Christopher, and Athitsos, Vas- silis Local Frame Match Distance: A novel approach for exemplar gesture recognition In Proceedings of EUSIPCO, pp 788{792, 2017a (cited on 4, 14) Ionescu, Radu Tudor, Smeureanu, Sorina, Alexe, Bogdan, and Popescu, Marius Unmasking the abnormal events in video In Proceedings of ICCV, pp 2895{ 2903, 2017b (cited on 4, 15) Ionescu, Radu Tudor, Smeureanu, Sorina, Popescu, Marius, and Alexe, Bogdan Detecting abnormal events in video using Narrowed Motion Clusters CoRR, abs/1801 05030, 2018 URLhttp://arxiv org/abs/1801 05030 (cited on 4, 15) Ionescu, RaduTudor, Chifu, Adrian-Gabriel, and Mothe, Josiane DeShaTo: De- scribing the Shape of Cumulative Topic Distributions to Rank Retrieval Sys- tems Without Relevance Judgments In Proceedings of SPIRE, volume 9309, pp 75{82 Springer LNCS, 2015b (cited on 3) Johnson, Rie and Zhang, Tong Eﬀective Use of Word Order for Text Catego- rization with Convolutional Neural Networks In Proceedings of NAACL, pp 103{112, 2015 (cited on 10) Jurafsky, Daniel and Martin, James H Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2000 (cited on 9) Koppel, Moshe, Schler, Jonathan, and Bonchek-Dokow, Elisheva Measuring Dif- ferentiability: Unmasking Pseudonymous Authors Journal of Machine Learn- ing Research, 8:1261{1276, December 2007 (cited on 11, 15) 24 REFERENCES Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoﬀrey E ImageNet Classiﬁca- tion with Deep Convolutional Neural Networks InProceedings of NIPS, pp 1106{1114, 2012 (cited on 2, 9) Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean A Sparse Texture Repre- sentation Using Local Aﬃne Regions IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265{1278, August 2005 (cited on 4) Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean Beyond Bags of Fea- tures: Spatial Pyramid Matching for Recognizing Natural Scene Categories In Proceedings of CVPR, volume 2, pp 2169{2178, Washington, DC, USA, 2006 IEEE Computer Society (cited on 4, 10, 14, 16) Lebret, Remi, Legrand, Joel, and Collobert, Ronan Is Deep Learning Really Necessary for Word Embeddings? In Proceedings of Deep Learning Workshop at NIPS, 12 2013 (cited on 10) LeCun, Yann, Bengio, Yoshua, and Hinton, Geoﬀrey Deep learning Nature, 521 (7553):436{444, 05 2015 (cited on 2, 4, 9, 12) Leslie, Christina S , Eskin, Eleazar, and Noble, William Staﬀord The Spectrum Kernel: A String Kernel for SVM Protein Classiﬁcation InProceedings of Paciﬁc Symposium on Biocomputing, pp 566{575, 2002 (cited on 3) Leung, Thomas and Malik, Jitendra Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons International Jour- nal of Computer Vision, 43(1):29{44, June 2001 (cited on 4, 7) Lodhi, Huma, Saunders, Craig, Shawe-Taylor, John, Cristianini, Nello, and Watkins, Christopher J C H Text Classiﬁcation using String Kernels Journal of Machine Learning Research, 2:419{444, 2002 (cited on 3) Lowe, David G Object Recognition from Local Scale-Invariant Features In Proceedings of ICCV, volume 2, pp 1150{1157, Washington, DC, USA, 1999 IEEE Computer Society (cited on 7) 25 REFERENCES Lowe, David G Distinctive Image Features from Scale-Invariant Keypoints In- ternational Journal of Computer Vision, 60(2):91{110, November 2004 (cited on 7) Lu, C , Shi, J , and Jia, J Abnormal Event Detection at 150 FPS in MATLAB In Proceedings of ICCV, pp 2720{2727, 2013 (cited on 15) Maji, Subhransu, Berg, Alexander C , and Malik, Jitendra Classiﬁcation us- ing intersection kernel support vector machines is eﬃcient InProceedings of CVPR IEEE Computer Society, 2008 (cited on 12) Manning, Christopher D and Schutze, Hinrich Foundations of Statistical Natural Language Processing MIT Press, Cambridge, MA, USA, 1999 (cited on 6, 9) Manning, Christopher D , Raghavan, Prabhakar, and Schutze, Hinrich Intro- duction to Information Retrieval Cambridge University Press, New York, NY, USA, 2008 (cited on 2, 3, 4, 6) Mehran, Ramin, Oyama, Alexis, and Shah, Mubarak Abnormal crowd behavior detection using social force model In Proceedings of CVPR, pp 935{942, 2009 (cited on 15) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gregory S , and Dean, Jeﬀrey Distributed Representations of Words and Phrases and their Compo- sitionality In Proceedings of NIPS, pp 3111{3119, 2013 (cited on 10, 12, 17, 18) Miller, George A WordNet: A Lexical Database for English Communications of the ACM, 38(11):39{41, November 1995 (cited on 12) Montavon, Gregoire, Orr, Genevieve B , and Muller, Klaus-Robert (eds ) Neural Networks: Tricks of the Trade, volume 7700 ofLecture Notes in Computer Science (LNCS) Springer, 2nd edition, 2012 (cited on 4) Nilsson, Nils J The quest for artiﬁcial intelligence: A history of ideas and achieve- ments 2010 (cited on 2) 26 REFERENCES Phandi, Peter, Chai, Kian Ming A , and Ng, Hwee Tou Flexible Domain Adap- tation for Automated Essay Scoring Using Correlated Linear Regression In Proceedings of EMNLP, pp 431{439, 2015 (cited on 17) Philbin, James, Chum, Ondrej, Isard, Michael, Sivic, Josef, and Zisserman, An- drew Object retrieval with large vocabularies and fast spatial matching In Proceedings of CVPR, pp 1{8, 2007 (cited on 8) Popescu, Marius and Grozea, Cristian Kernel methods and string kernels for authorship analysis In Forner, Pamela, Karlgren, Jussi, and Womser-Hacker, Christa (eds ),CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, September 2012 (cited on 3) Rabinovich, A , Vedaldi, A , Galleguillos, C , Wiewiora, E , and Belongie, S Objects in Context In Proceedings of ICCV, 2007 (cited on 6) Ravanbakhsh, Mahdyar, Nabi, Moin, Sangineto, Enver, Marcenaro, Lucio, Regaz- zoni, Carlo, and Sebe, Nicu Abnormal Event Detection in Videos using Gen- erative Adversarial Nets In Proceedings of ICIP, 2017 (cited on 15) Rosenblatt, Frank The Perceptron{a perceiving and recognizing automaton Technical report, Cornell Aeronautical Laboratory, 1957 Report 85{460{1 (cited on 2) Rumelhart, David E , Hinton, Geoﬀrey E , and Williams, Ronald J Learning rep- resentations by back-propagating error Nature, 323(9):533{536, 1986 (cited on 2) Sadeghi, M A and Farhadi, A Recognition using visual phrases InProceed- ings of CVPR, pp 1745{1752, Washington, DC, USA, 2011 IEEE Computer Society (cited on 9) Saligrama, Venkatesh and Chen, Zhu Video anomaly detection based on local statistical aggregates InProceedings of CVPR, pp 2112{2119, 2012 (cited on 15) Sebastiani, Fabrizio Machine Learning in Automated Text Categorization ACM Computing Surveys, 34(1):1{47, March 2002 (cited on 3) 27 REFERENCES Shawe-Taylor, John and Cristianini, Nello Kernel Methods for Pattern Analysis Cambridge University Press, 2004 (cited on 3) Simonyan, K and Zisserman, A Very Deep Convolutional Networks for Large- Scale Image Recognition In Proceedings of ICLR, 2014 (cited on 3, 9) Sivic, Josef, Russell, Bryan C , Efros, Alexei A , Zisserman, Andrew, and Free- man, William T Discovering Objects and their Localization in Images In Proceedings of ICCV, pp 370{377 IEEE Computer Society, 2005 (cited on 4, 7) Smeureanu, Sorina, Ionescu, Radu Tudor, Popescu, Marius, and Alexe, Bogdan Deep Appearance Features for Abnormal Behavior Detection in Video In Proceedings of ICIAP, volume 10485, pp 779{789, 2017 (cited on 15) Sun, Qianru, Liu, Hong, and Harada, Tatsuya Online growing neural gas for anomaly detection in changing surveillance scenes Pattern Recognition, 64(C): 187{201, April 2017 (cited on 16) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V Sequence to Sequence Learning with Neural Networks InProceedings of NIPS, pp 3104{3112, 2014 (cited on 10) Tay, Yi, Phan, Minh C , Tuan, Luu Anh, and Hui, Siu Cheung SkipFlow: Incor- porating Neural Coherence Features for End-to-End Automatic Text Scoring In Proceedings of AAAI, pp 1{8, 2018 (cited on 17) Vedaldi, Andrea and Zisserman, Andrew Eﬃcient additive kernels via explicit feature maps InProceedings of CVPR, pp 3539{3546, San Francisco, CA, USA, 2010 IEEE Computer Society (cited on 12) Zhang, Jian, Marszalek, Marcin, Lazebnik, Svetlana, and Schmid, Cordelia Local Features and Kernels for Classiﬁcation of Texture and Object Categories: A Comprehensive Study International Journal of Computer Vision, 73(2):213{ 238, June 2007 (cited on 3, 8) 28 REFERENCES Zhang, Ying, Lu, Huchuan, Zhang, Lihe, Ruan, Xiang, and Sakai, Shun Video anomaly detection based on locality sensitive hashing ﬁlters Pattern Recogni- tion, 59:302{311, 2016 (cited on 16) 29 Chapter 2 Similarity-based Learning and Deep Learning Abstract The chapter describes all the machine learning methods that are employed in this thesis to obtain results for diﬀerent applications of computer vision and string processing The chapter gives an overview of the main concepts of learning based on similarity Speciﬁc machine learning methods that are based on these concepts are then presented First, nearest neighbor models are discussed An overview of kernel methods is also given, since the state-of-the-art methods consistently used in the supervised learning tasks presented throughout this thesis are kernel methods The chapter continues with a discussion about cluster analysis Cluster- ing techniques are used throughout this thesis in various contexts, from building vocabularies of visual words to outlier detection The chapter ends with a discus- sion about deep learning, giving special attention to convolutional neural networks Convolutional neural networks are employed to extract deep appearance features useful for abnormal event detection in video 30 2 1 Introduction In this chapter, we discuss two machine learning paradigms, namely deep learn- ing [Goodfellow et al , 2016] and similarity-base learning [Chen et al , 2009], that are employed in various tasks discussed in this thesis Learning based on simi- larity refers to the process of learning based on pairwise similarities between the training samples The similarity-based learning process can be both supervised and unsupervised, and the pairwise relationship can be either a similarity, a dis- similarity, or a distance function Similarity functions may be asymmetric and even fail to satisfy other mathematical properties required for metrics or inner products, for example When the learning process is supervised, the similarity- based method aims at estimating the class label of a test sample using both the pairwise similarities between the labeled training samples, and the similarities be- tween the test sample and the set of training samples When the learning process is unsupervised, the similarity-based method aims at ﬁnding some hidden struc- ture in the unlabeled training samples, using the pairwise similarities between samples An advantage of similarity-based learning is that it does not require direct access to the features, as long as the similarity function is well deﬁned and can be computed for any pair of samples Thus, the feature space is not required to be a Euclidean space On the other hand, deep learning provides a way to transform one feature representation into another, by better disentangling the factors of variation that explain the observed data Deep learning algorithms are aimed at discovering multiple levels of representation, or a hierarchy of features The goal of deep learning is to replace features handcrafted by engineers with features that are learned from data into an end-to-end fashion The main rea- son behind the success of deep learning methods is that the end-to-end learning process provides a better feature representation when there is enough training data The rest of this chapter is organized as follows Section 2 2 provides an overview of similarity-based learning Nearest neighbor models are discussed in Section 2 3 An overview of kernel methods is given in Section 2 4 The chap- ter continues with Section 2 5, which gives an overview of clustering methods based on similarity The chapter ends with Section 2 6, which provides a brief 31 description of deep learning with a focus on convolutional neural networks 2 2 Similarity-based Learning Similarity-based learning has a long history starting with k-nearest neighbors [Fix & Hodges, 1951], which is one of the oldest machine learning algorithms, and stretching to the state-of-the-art kernel methods [Shawe-Taylor & Cristianini, 2004] Similarity-based learning methods have been widely used in several do- mains such as computer vision, natural language processing, computational bi- ology, and information retrieval Computer vision researchers proposed several methods based on computing similarity between images for object recognition and image retrieval Such methods range from distance measures such as the Tangent distance [Simard et al , 1996], the Earth Mover's distance [Rubner et al , 2000], or the shape matching distance [Belongie et al , 2002], to kernel methods such as the pyramid match kernel [Grauman & Darrell, 2005; Lazebnik et al , 2006] or the PQ kernel [Ionescu & Popescu, 2013, 2015] Most of the state-of-the- art techniques in computational biology, such as those that obtain phylogenetic trees or those that compare DNA sequences, are based on distance measures for strings Popular choices for recent techniques are the Hamming distance [Chimani et al , 2011; Vezzi et al , 2012], edit distance [Shapira & Storer, 2003], Kendall's tau distance [Popov, 2007] or rank distance [Dinu & Ionescu, 2012a,b, 2013] Other popular similarity-based tools from computational biology are the FASTA algorithm [Lipman & Pearson, 1985] and the BLAST algorithm [Altschul et al , 1990] These tools compute the similarity between diﬀerent amino acid sequences for protein classiﬁcation The cosine similarity between term frequency-inverse document frequency (TF-IDF) vectors is widely used in information retrieval and text mining for document classiﬁcation [Manning et al , 2008] More recently, the string kernel [Shawe-Taylor & Cristianini, 2004], which computes the simi- larity between strings by counting common charactern-grams, has demonstrated impressive levels of performance for text categorization (by topic) [Lodhi et al , 2002], authorship identiﬁcation [Popescu & Dinu, 2007; Popescu & Grozea, 2012; Sanderson & Guenter, 2006], native language identiﬁcation [Ionescu & Popescu, 2017; Ionescu et al , 2014, 2016b; Popescu & Ionescu, 2013], and Arabic dialect 32 identiﬁcation [Ionescu & Butnaru, 2017; Ionescu & Popescu, 2016] The similarity-based learning paradigm consists of a wide variety of algo- rithms and approaches Among the variety of similarity-based learning methods, only three of them are discussed in dedicated sections of this chapter, namely the nearest neighbor approach, the kernel methods and the cluster analysis tech- niques These three approaches are used in diﬀerent applications presented in this work Since many of them are widely known and studied in literature, this chapter is rather aimed at giving an overview of the approaches used throughout this thesis Other similarity-based learning methods, such as treating similarities as features, or generative classiﬁers, are brieﬂy mentioned next By treating the similarities between a sample and training samples as features, similarity-based classiﬁcation problems can be regarded as standard classiﬁcation problems [Chen et al , 2009; Graepel et al , 1999, 1998; Liao & Noble, 2003; Pekalska & Duin, 2002] In other words, each sample is represented by a feature vector obtained by computing the similarity with a set of training samples Generative classiﬁers provide a structured probabilistic model of the data Training data is used for estimating the parameters of the generative model Given the pairwise similarity ofn samples, one approach to generative classiﬁcation is using the similarities as features Then, the parameters of a standard generative model can be estimated from ann-dimensional feature space Recently, another generative framework for similarity-based classiﬁcation, termed similarity discriminant analysis, has been proposed in [Cazzanti et al , 2008] It models the class-conditional distributions of similarity statistics Other approaches designed to reduce bias are a local vari- ant proposed in [Cazzanti & Gupta, 2007] and a mixture model variant discussed in [Chen et al , 2009] 2 3 Nearest Neighbor Approach Since the introduction of the k-nearest neighbors algorithm (k-NN) in [Fix & Hodges, 1951], the algorithm has been studied by many researchers and it is still an active topic in machine learning The k-nearest neighbors algorithm is one of the simplest of all the machine learning algorithms, proving that simple models are always attractive for researchers The nearest neighbor model is described in 33 Algorithm 1: Nearest Neighbor Algorithm 1Input: 2S = f(xi; ti)j xi2Rm; ti2N; i2 f1; 2; :::; ngg { the set of n training samples and labels; 3 Z = fzij zi2Rm; i2 f1; 2; :::; lgg { the set of l test samples; 4 k { the number of neighbors; 5 { a distance measure 6Initialization: 7 Y ;; 8Computation: 9for zi2Z do 10N the nearest k neighbors to zifromS according to ; 11y the majority label obtained through a voting scheme onN; 12Y Y [ fyg; 13Output: 14 Y = fyij yi2N; i2 f1; 2; :::; lgg { the set of predicted labels for the test samples in Z Algorithm 1 The k-nearest neighbors classiﬁcation rule employed in step 11 of Algorithm 1 works as follows: an object is assigned to the most common class of itsk nearest neighbors, wherek is a positive integer value Ifk = 1, then the object is simply assigned to the class of its single nearest neighbor When k >1, the decision is based on a majority vote It is convenient to let k be odd, to avoid voting ties However, if voting ties do occur, the object can be assigned to the class of its 1-nearest neighbor, or one of the tied classes can be randomly chosen to be the class assigned to the object The output of Algorithm 1 is a set of labels associated to the test samples The example about handwritten digit recognition presented in Figure 2 1 gives some insights of how the k-NN model works in practice In this example, digits are represented in a two-dimensional feature space When a new samplex comes in, the algorithm selects the nearest 3 neighbors and assigns the majority class to x In Figure 2 1, the majority label among the nearest 3 neighbors of x is 4 Thus, label 4 is assigned to x This model can be referred to as a 3-NN model To better understand how the decision of the k-NN model is taken in general, it 34 Figure 2 1: A 3-NN model for handwritten digit recognition For visual inter- pretation, digits are represented in a two-dimensional feature space The ﬁgure shows 30 digits sampled from the popular MNIST data set When the new digit xneeds to be recognized, the 3-NN model selects the nearest 3 neighbors and assigns label 4 based on a majority vote is worth considering a 1-NN model For this model, the decision at every point is to assign the label of the closest data point This process generates a Voronoi partition of the training samples, as seen in Figure 2 2 Each training data point corresponds to a Voronoi cell When a new data point comes in, it is assigned to the class associated to the Voronoi cell in which the respective data point falls in The k-NN algorithm is a non-parametric method for classiﬁcation In other words, no parameters have to be learned In fact, the k-NN model does not re- quire training at all The decision of the classiﬁer is only based on the nearest kneighbors of an object with respect to a similarity or distance function The Euclidean distance measure is a very common choice, but other similarity mea- 35 Figure 2 2: A 1-NN model for handwritten digit recognition The ﬁgure shows 30 digits sampled from the popular MNIST data set The decision boundary of the 1-NN model generates a Voronoi partition of the digits sures can also be used instead Actually, the performance of the k-NN classiﬁer depends on the strength and the discriminatory power of the distance measure used It is worth mentioning that a good choice of the distance metric can help to achieve invariance with respect to a certain family of transformations For example, a distance metric that is invariant to scale, rotation, luminosity and contrast changes is a suitable choice for computer vision tasks Researchers con- tinue to study and develop new similarity or dissimilarity measures for a broad variety of applications in diﬀerent domains But, when it comes to testing the similarity measure in machine learning tasks, the method of choice is the k-NN model, because it deeply reﬂects the strength of the similarity measure Good examples of this fact are the Tangent distance [Simard et al , 1996] and the shape 36 matching distance [Belongie et al , 2002], which are both used for handwritten digit recognition For the same reason, the k-NN model is used to assess the performance of the new distance measure for gesture trajectories presented in Chapter 4 of this work It is interesting to mention that the k-NN model is one of the ﬁrst classiﬁers for which an upper bound of its error rate has been demonstrated More precisely, a theoretical result demonstrated in [Cover & Hart, 1967] states that the nearest neighbor rule is asymptotically at most twice as bad as the Bayes rule Further- more, ifk is allowed to grow withn such thatk=n! 0, the nearest neighbor rule is universally consistent More consistency results and other theoretical aspects of the k-NN model are discussed in [Devroye et al , 1996] The k-NN model defers all the computations to the test phase This repre- sents a great disadvantage when the computational time is taken into consid- eration Searching for the k nearest neighbors among n training samples may take time proportional to O(nkd) using a naive approach, where d represents the computational cost of the distance function Diﬀerent approaches based on multidimensional search trees that partition the space and guide the search have been proposed to reduce the time complexity [Dasarathy, 1991] Other fast k-NN approaches are proposed in [Farago et al , 1993] and [Zhang & Srihari, 2004] 2 4 Kernel Methods In the similarity-based learning paradigm, a popular approach is to treat the pairwise similarities as inner products in some Hilbert space or to treat pairwise dissimilarities as distances in some Euclidean space This can be achieved in roughly two ways One is to explicitly embed the samples in a Euclidean space, according to the pairwise similarities (or dissimilarities) using multidimensional scaling [Borg & Groenen, 2005] Another is to modify the similarities into kernels and apply kernel methods This section is focused on the latter approach and it covers the following topics: an overview of kernel methods, methods of combining kernels, such as kernel alignment, multiple kernel learning (MKL), and state-of- the-art kernel methods such as Support Vector Machines (SVM), Kernel Ridge Regression (KRR), Kernel Linear Discriminant Analysis (KDA), or Kernel Partial 37 Least Squares Regression (KPLS) Special consideration is given to the topics that discuss kernel approaches used throughout the experiments presented in this thesis Kernel-based learning algorithms work by embedding the data into a Hilbert space and by searching for linear relations in that space, using a learning algo- rithm The embedding is performed implicitly, that is by specifying the inner product between each pair of points rather than by giving their coordinates ex- plicitly The power of kernel methods lies in the implicit use of a Reproducing Kernel Hilbert Space induced by a positive semi-deﬁnite kernel function De- spite the fact that the mathematical meaning of a kernel is the inner product in a Hilbert space, another interpretation of a kernel is the pairwise similarity between samples The kernel function oﬀers to the kernel methods the power to naturally handle input data that is not in the form of numerical vectors, such as strings, images, or even video and audio ﬁles The kernel function captures the intuitive notion of similarity between objects in a speciﬁc domain and can be any function deﬁned on the respective domain that is symmetric and positive deﬁnite For strings, many such kernel functions exist with various applications in computational biology and computational linguistics [Shawe-Taylor & Cristianini, 2004] For images, a state-of-the-art approach is the pyramid match kernel [Grauman & Darrell, 2005; Lazebnik et al , 2006] 2 4 1 Mathematical Preliminaries This section follows the theoretical presentation given in [Shawe-Taylor & Cris- tianini, 2004] Therefore, most of the deﬁnitions, propositions and theorems are reproduced from [Shawe-Taylor & Cristianini, 2004] for the sake of completeness of this chapter A deﬁnition of an inner product space is given next Deﬁnition 1A vector spaceX over the set of real numbersR is an inner product space, if there exists a real-valued symmetric bilinear (linear in each argument) maph;i, that satisﬁeshx; xi 0, for all x2X The bilinear map is known as the inner product, dot product or scalar product 38 An inner product space is sometimes referred to as a Hilbert space, although most researchers agree that additional properties of completeness and separability are required Formally, a Hilbert space can be deﬁned as follows Deﬁnition 2A Hilbert SpaceHis an inner product space with the additional properties of completeness and separability A spaceHis complete if every Cauchy sequencefhngnof elements ofH converges to a element h2H, where 1 a Cauchy sequence is one that satisﬁes the property that supkhnhmk ! 0; as n! 1: m>n A spaceHis separable if for any >0there is a ﬁnite set of elements fh1; :::; hNg of H such that for all h2H minkhihk 0 xyy mean(u; v) =x 1;otherwise (2 2 (susv)+susv;if hu; hv> 0 xyy std(u; v) =x 1;otherwise wheremx,my,sx, andsyare components of the 5-tuplesu and v If a visual word does not appear in at least one of the two compared images, its contribution to kSNAKis zero, since meanand stdare inﬁnite We can easily demonstrate that SNAK is a kernel function, by showing that it can be regarded as a sum of multiple kernels Indeed, the proof thatkSNAKis a kernel comes out immediately from the following observation For a given visual wordi and two 5-tuplesu and v, the equations below represent two RBF kernels: exp (c1(u(i); v(i))) mean exp (c2(u(i); v(i))); std and their product is also a kernel By summing up the RBF kernels corresponding to all the 5-tuples inside the SNAK feature vectorsU and V , thekSNAKfunction is obtained From the additive property of kernel functions given in Proposition 2, 81 it results that kSNAKis also a kernel function An interesting remark is that kSNAKcan be seen as a sum of separate kernel functions, each corresponding to a visual word that appears in both images This is a fairly simple approach, that can be easily generalized and combined with many other kernel functions The following equation shows how to combine SNAK with another kernel kthat takes into account the frequency of visual words: n Xu k(U; V ) =k(h(i); hv(i)) SNAK i=1(3 2) exp (c1(u(i); v(i))) exp (c2(u(i); v(i))): mean std Equation (3 2) can be used to combine SNAK with other kernels at the visual word level, individually Certainly, using the above equation, SNAK can be combined with kernels such as the linear kernel, the Hellinger's kernel, or the intersection kernel The following equation is a particularization of Equation (3 2) for the intersection kernel: n Xu k\(U; V ) =minfh(i); hv(i)g SNAK i=1 exp (c1(u(i); v(i))) exp (c2(u(i); v(i))): mean std Moreover, being a kernel function, SNAK can be combined with any other ker- nel using various approaches speciﬁc to kernel methods, such as multiple kernel learning 3 4 1 Translation and Size Invariance Intuitively, for a given visual word, the SNAK kernel measures the distance be- tween the average positions of the respective visual word in two images SNAK can be used to encode spatial information for various classiﬁcation tasks, but some improvements based on task-speciﬁc information are possible One such example is object class recognition If the objects appear in roughly the same locations in the image, the SNAK approach would work ﬁne However, this re- striction may be often violated in practice Any object can appear in any part 82 Figure 3 2: The spatial similarity of two images computed with the SNAK frame- work First, the center of mass is computed according to the objectness map The average position and the standard deviation of the spatial distribution of each vi- sual word are computed next The images are aligned according to their centers, and the SNAK kernel is computed by summing the distances between the average positions and the standard deviations of each visual word in the two images of the image, and a visual word describing some part of the object can therefore appear in a diﬀerent location in each image Due to this fact, SNAK is not invari- ant to translations of the object If the object's location in each image is known a priori, the average position of the visual word can be computed with respect to 83 the object's location, by translating the origin of the coordinate system over the center of the object The exact location of the object is not known in practice (as it requires human annotation), but it can be approximated using the objectness measure [Alexe et al , 2010, 2012] This measure quantiﬁes how likely it is for an image window to contain an object By sampling a reasonable number of win- dows and by accumulating their probabilities, a pixelwise objectness map of the image can be produced The objectness map provides a meaningful distribution of the (interesting) image regions that indicate locations of objects Furthermore, the center of mass of the objectness map provides a good indication of where the center of the object might be The SNAK framework employs the objectness measure to determine the object's center in order to use it as the origin of the coordinate system of the image The range of the coordinate system is normalized by dividing the x-axis coordinates by the width of the image and the y-axis co- ordinates by the height of the image For each image, the coordinate system has a range from1 to 1 on each axis Normalizing the coordinates ensures that the average position or the standard deviation of a visual word do not depend on the image size, and it is a necessary step to reduce the eﬀect of size variation in a set of images The SNAK framework is illustrated in Figure 3 2 Although the ideal conditions for SNAK would be to have a single object per image, this is rarely the case in practice, yet it still achieves impressive performance Nonetheless, its performance would probably get even better if a class-speciﬁc object localization or detection framework would be used instead of the objectness measure, but SNAK would also become less generally applicable, in the sense that it would need class-speciﬁc information to work 3 5 Object Recognition Experiments The object recognition experiments presented in this section compare the SNAK kernel with state-of-the-art kernels and spatial representations on two benchmark data sets A brief description of the data sets is ﬁrst provided Details about the implementation of the learning model and the evaluation procedure are given next Finally, the results of the compared kernels on each data set are discussed 84 3 5 1 Data Sets Description The Pascal Visual Object Classes (VOC) challenge [Everingham et al , 2010] is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard data set of annotated images and standard evaluation procedures In the experiments of this work, the Pascal VOC 2007 data set is used The reason for this choice is that this is the latest data set for which testing labels are available for download, and the experiments can be done oﬄine There are roughly 10 thousand images in this data set, that contain 20 annotated object classes As illustrated in Figure 3 3, some images may contain objects from several classes Thus, the class labels are not mutually exclusive For each class, the data set provides a training set, a validation set and a test set The training and validation sets have roughly 2500 images each, while the test set has about 5000 images This data set is available athttp://host robots ox ac uk/pascal/VOC/voc2007/index html The second data set was collected from the Web by the authors of [Lazebnik et al , 2005] and consists of 100 images each of 6 diﬀerent classes of birds: egrets, mandarin ducks, snowy owls, puﬃns, toucans, and wood ducks The training set consists of 300 images and the test set consists of another 300 images For each class, the data set contains 50 positive train images and 50 positive test images The purpose of using this data set is to assess the behavior of the proposed kernels in the context of ﬁne-grained object recognition Figure 3 4 shows two images from each class of the Birds data set The data set is available athttp://www-cvr ai uiuc edu/ponce grp/data/ 3 5 2 Implementation and Evaluation Procedure Details about the particularities of the learning framework are given next In the feature detection and representation step, a variant of dense SIFT descriptors extracted at multiple scales is used [Bosch et al , 2007] The implementation of the BOVW model is mostly based on the VLFeat library [Vedaldi & Fulkerson, 2008] The SNAK framework is compared with the spatial pyramid Three ker- nels are proposed for evaluation, namely the L2-normalized linear kernel, the L1 85 Figure 3 3: A random sample of 12 images from the Pascal VOC data set Some of the images contain objects of more than one class For example, the image at the top left shows a dog sitting on a couch, and the image at the top right shows a person and a horse Dog, couch, person and horse are among the 20 classes of this data set -normalized Hellinger's kernel, and the L1-normalized intersection kernel An important remark is that the intersection kernel was particularly chosen because it yields very good results in combination with the spatial pyramid according to [Lazebnik et al , 2006], and it might work equally well in the SNAK framework The three kernels proposed for evaluation are based on four diﬀerent representa- 86 Figure 3 4: A random sample of 12 images from the Birds data set There are two images per class Images from the same class sit next to each other in this ﬁgure tions, three of which include spatial information The goal of the experiments is to compare the standard bag-of-visual-words representation with a spatial pyra- mid based on two levels, a spatial pyramid based on three levels, and the SNAK feature vectors The spatial pyramid based on two levels combines the full image with 2 2 bins, and the spatial pyramid based on three levels combines the full image with 2 2 and 4 4 bins In the SNAK framework, the linear kernel, the Hellinger's kernel, and the intersection kernel are used in turn askin Equa- tion (3 2) It is worth noting that SNAK can also be indirectly compared with the approach described in [Krapac et al , 2011], since the results reported in [Krapac et al , 2011] are very similar to the spatial pyramid based on three levels The norms of all the evaluated kernels are chosen according to [Vedaldi & Zisserman, 2010], that state that -homogeneous kernels should be L -normal- 87 ized Furthermore, it is important to mention that all these kernels are used in the dual form, that implies using the kernel trickto directly build kernel matri- ces of pairwise similarities between samples Since the kernel trick is employed in the evaluation, it comes natural to obtain the spatial pyramid for each ker- nel by summing up kernel matrices obtained for each level of the pyramid The spatial pyramid representation is usually obtained as a concatenation of visual word histograms, but in the dual representation, concatenating feature vectors is equivalent to summing up kernel matrices In all the experiments, the training is always done using Support Vector Machines (SVM) On the Birds data set, the SVM classiﬁer based on the one- versus-all scheme is used for the multi-class task The SNAK approach employs the objecteness measure to align images The objectness measure is trained on 50 images that are neither from the Pascal VOC data set nor from the Birds data set The objectness map is obtained by sampling 1000 windows using the Non-Maximal Supression (NMS) sampling procedure [Alexe et al , 2012] The source code used to generate the objectness heat maps is available online at http://groups inf ed ac uk/calvin/objectness/ The experiments are conducted using 500, 1000, and 3000 visual words, re- spectively The evaluation procedure on the Pascal VOC data set follows the Pascal VOC benchmark As such, the qualitative performance of the learning model is measured by using the classiﬁer score to rank all the test images Next, the retrieval performance can be measured by computing a precision-recall curve In order to represent the retrieval performance by a single number (rather than a curve), the mean average precision (mAP) is often computed The average precision as deﬁned by TREC is used in the Pascal VOC experiments This is the average of the precision observed each time a new positive sample is recalled For the experiments performed on the Birds data set, the classiﬁcation accuracy is used to evaluate the various kernels and spatial representations 3 5 3 Parameter Tuning The SNAK framework takes both the average position and the standard deviation of each visual word into account In a set of preliminary experiments performed on 88 the Birds data set, the two statistics were used independently to determine which one brings a more signiﬁcant improvement The empirical results demonstrated that they roughly achieve similar accuracy improvements, having an almost equal contribution to the proposed framework Consequently, a decision was made to use the same value for the two constants c1and c2from Equation (3 1) Only ﬁve values in the range 1 to 100 were chosen for preliminary evaluation The best results were obtained with c1= c2= 10, while choices like 5 or 50 were only 2 3% behind Finally, a decision was made to use c1= c2= 10 in the experiments reported next, but it is very likely that better results can be obtained by ﬁne-tuning the parametersc1andc2on each data set An important remark is thatc1andc2were tuned on the Birds data set, but the same choice was used on the Pascal VOC data set, without testing other values Good results on Pascal VOC might indicate thatc1andc2do not necessarily depend on the data set, but rather on the normalization procedure used for the spatial coordinate system It is interesting to note that the two coordinates are independently normalized as described in Section 3 4 1, resulting in small distortions along the axes Two other methods of size-normalizing the coordinate space without introducing distortions were also evaluated One is based on dividing both coordinates by the diagonal of the image, and the other by the mean of the width and height of the image Perhaps surprisingly, these have produced lower average precision scores on a subset of the Pascal VOC data set For instance, size-normalizing by the mean of the width and height gives a mAP score that is roughly 0:5% lower than normalizing each axis independently by the width and height In the Pascal VOC experiment, the validation set is used to validate the regularization parameter C of the SVM algorithm In the Birds experiment, the parameter C was adjusted by cross-validation on the training set 3 5 4 Results on Pascal VOC Experiment The SNAK framework is ﬁrst evaluated on the Pascal VOC 2007 data set For each of the 20 classes, the data set provides a training set, a validation set and a test set After validating the regularization parameter of the SVM algorithm on the validation set, the classiﬁer is trained one more time on both the training 89 Table 3 1: Mean AP on Pascal VOC 2007 data set for diﬀerent representations that encode spatial information into the BOVW model For each representation, results are reported using several kernels and vocabulary dimensions The best AP for each vocabulary dimension and each kernel is highlighted in bold RepresentationVocabularyLin L2Hel L1Int L1 Histogram500 words28:59%39:06%39:11% Histogram1000 words28:71%42:28%42:99% Histogram3000 words28:96%45:23%46:97% Spatial pyramid (2 levels)500 words31:17%44:21%45:17% Spatial pyramid (2 levels)1000 words31:38%46:94%48:27% Spatial pyramid (2 levels)3000 words31:85%49:21%50:78% Spatial pyramid (3 levels)500 words38:49%45:20%47:66% Spatial pyramid (3 levels)1000 words39:59%47:87%49:85% Spatial pyramid (3 levels)3000 words40:97%50:37%51:87% SNAK500 words42:56%47:39%49:75% SNAK1000 words44:69%49:54%51:99% SNAK3000 words45:95%52:49%54:05% and the validation sets, that have roughly 5000 images together Table 3 1 presents the mean AP of various BOVW models obtained on the test set, by combining diﬀerent spatial representations, vocabulary dimensions, and kernels For each model, the reported mAP represents the average score on all the 20 classes of the Pascal VOC data set The results presented in Table 3 1 clearly indicate that spatial information improves the performance of the BOVW model by a considerable margin This observation holds for every kernel and every vocabulary dimension Indeed, the spatial pyramid based on two levels shows a performance increase that ranges between 3% (for the linear kernel) and 6% (for intersection kernel) As expected, the spatial pyramid based on three levels further improves the performance, especially for the linear kernel When the 4 4 bins are added into the spatial pyramid, the mAP of the linear kernel grows by roughly 7-8%, while the mAP scores of the other two kernels increase by 1-2% Among the three kernels based on spatial pyramids, the best mAP scores are obtained by the intersection kernel, which was previously reported to work best in combination with the spatial pyramid [Lazebnik et al , 2006] 90 The best results on the Pascal VOC data set are obtained by the SNAK framework Indeed, the results are even better than the spatial pyramid based on three levels, which uses a representation that is more than four times greater than the SNAK representation The mAP scores of the Hellinger's and the intersection kernels based on SNAK are roughly 2% better than the mAP scores of the same kernels combined with the spatial pyramid based on three levels On the other hand, a 4-5% growth of the mAP score can be observed in case of the linear kernel Among the three kernels, the best results are obtained by the intersection kernel When the intersection kernel is combined with SNAK, the best overall mAP score is obtained, that is 54:05% This is 2:18% better than the intersection kernel combined with the spatial pyramid based on three levels Overall, the empirical results indicate that the SNAK approach is consid- erably better than the state-of-the-art spatial pyramid framework, in terms of recognition accuracy Perhaps this comes as a surprising result given that the images from the Pascal VOC data set usually contain multiple objects, and that SNAK implicitly assumes that there is a single relevant object in the scene, due to the use of the objecteness measure The SNAK framework also provides a more compact representation, which brings improvements in terms of space and time over a spatial pyramid based on three levels, for example 3 5 5 Results on Birds Experiment The SNAK framework is next evaluated on the Birds data set Table 3 2 presents the classiﬁcation accuracy of the BOVW model based on various representations that include spatial information The results are reported on the test set, by combining diﬀerent vocabulary dimensions and kernels The results of the SNAK framework on this data set are consistent with the results reported in the previous experiment, in that the SNAK framework yields better performance than the spatial pyramid representation The spatial pyramid based on two levels improves the classiﬁcation accuracy of the standard BOVW model by 3-4% On top of this, the spatial pyramid based on three levels further improves the performance Considerable improvements can be observed for the linear kernel and for the intersection kernel 91 Table 3 2: Classiﬁcation accuracy on the Birds data set for diﬀerent represen- tations that encode spatial information into the BOVW model For each repre- sentation, results are reported using several kernels and vocabulary dimensions The best accuracy for each vocabulary dimension and each kernel is highlighted in bold RepresentationVocabularyLin L2Hel L1Int L1 Histogram500 words59:67%72:00%70:00% Histogram1000 words64:67%78:33%71:00% Histogram3000 words69:33%80:33%74:67% Spatial pyramid (2 levels)500 words62:67%75:67%74:00% Spatial pyramid (2 levels)1000 words66:67%79:33%74:33% Spatial pyramid (2 levels)3000 words69:67%81:00%77:00% Spatial Pyramid (3 levels)500 words68:33%76:67%76:00% Spatial Pyramid (3 levels)1000 words70:33%80:67%78:00% Spatial Pyramid (3 levels)3000 words73:00%82:67%79:67% SNAK500 words69:33%79:00%76:33% SNAK1000 words71:67%80:33%78:67% SNAK3000 words72:33%83:67%81:33% The spatial pyramid based on two levels shows little improvements over the histogram representation for the vocabulary of 3000 words, and more signiﬁcant improvements for the vocabulary of 500 words The certain fact is that the spatial information helps to improve the classiﬁcation accuracy on this data set, but the best approach seems to be the SNAK framework With only two exceptions, the SNAK framework gives better results than the spatial pyramid based on three levels Compared to the spatial pyramid based on two levels, which has the same number of features, the SNAK approach is roughly 3-5% better An interesting observation is that the intersection kernel does not yield the best overall results as in the previous experiment, but it seems to gain a lot from the spatial information For instance, the accuracy of the intersection kernel grows from 71:00% with histograms to 78:67% with SNAK, when the underlying vocabulary has 1000 words The best accuracy (83:67%) is obtained by the Hellinger's kernel combined with SNAK, using a vocabulary of 3000 visual words When it comes to ﬁne- grained object class recognition, the overall empirical results on the Birds data set indicate that the SNAK framework is more accurate than the spatial pyramid 92 REFERENCES approach 3 6 Discussion This chapter discussed an improvement of the BOVW model for object recogni- tion The contribution described in this chapter is an approach to improve the BOVW model by encoding spatial information in a more eﬃcient way than spa- tial pyramids, by using a kernel function termed SNAK More precisely, SNAK includes the spatial distribution of the visual words in the similarity of two im- ages The empirical results indicate that the SNAK framework can improve the object recognition accuracy over the spatial pyramid representation Consider- ing that SNAK uses a more compact representation, the results become even more impressive In conclusion, SNAK has all the ingredients to become a viable alternative to the spatial pyramid approach In this work, the objectness measure was used to add some level of translation invariance into the SNAK framework In future work, the SNAK framework can be further improved by including ways of obtaining scale and rotation invariance Ground-truth information about an object's scale can be obtained from manually annotated bounding boxes A ﬁrst step would be to use such bounding boxes to determine if it helps to compare objects at the same scale with the SNAK kernel Another direction, is to extend the SNAK framework to use the valuable information oﬀered by objectness [Alexe et al , 2010], which is only barely used in the current framework References Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio What is an object? In Proceedings of CVPR, pp 73{80, San Francisco, CA, USA, June 2010 IEEE (cited on 74, 84, 93) Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio Measuring the object- ness of image windows IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189{2202, 2012 (cited on 74, 84, 88) 93 REFERENCES Bosch, Anna, Zisserman, Andrew, and Munoz, Xavier Image Classiﬁcation using Random Forests and Ferns In Proceedings of ICCV, pp 1{8 IEEE Computer Society, 2007 (cited on 77, 85) Csurka, Gabriella, Dance, Christopher R , Fan, Lixin, Willamowski, Jutta, and Bray, Cdric Visual categorization with bags of keypoints InProceedings of Workshop on Statistical Learning in Computer Vision at ECCV, pp 1{22, 2004 (cited on 75, 76) Dalal, Navneet and Triggs, Bill Histograms of Oriented Gradients for Human Detection In Proceedings of CVPR, volume 1, pp 886{893, Washington, DC, USA, 2005 IEEE Computer Society (cited on 77) Deselaers, Thomas, Keyser, Daniel, and Ney, Hermann Discriminative Training for Object Recognition using Image Patches InProceedings of CVPR, pp 157{162, 2005 (cited on 75) Dinu, Liviu P and Manea, Florin An eﬃcient approach for the rank aggregation problem Theoretical Computer Science, 359(1{3):455{461, 2006 (cited on 74, 80) Dinu, Liviu P , Ionescu, Radu Tudor, and Popescu, Marius Local Patch Dissimi- larity for Images In Proceedings of ICONIP, volume 7663, pp 117{126 LNCS Springer-Verlag, 2012 (cited on 74, 80) Everingham, Mark, van Gool, Luc, Williams, Christopher K , Winn, John, and Zisserman, Andrew The Pascal Visual Object Classes (VOC) Challenge Inter- national Journal of Computer Vision, 88(2):303{338, June 2010 (cited on 85) Fei-Fei, Li and Perona, Pietro A Bayesian Hierarchical Model for Learning Nat- ural Scene Categories In Proceedings of CVPR, volume 2, pp 524{531, Wash- ington, DC, USA, 2005 IEEE Computer Society (cited on 76) Fei-Fei, Li, Fergus, Rob, and Perona, Pietro Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories Computer Vision and Image Understanding, 106(1):59{70, April 2007 (cited on 73) 94 REFERENCES Felzenszwalb, Pedro F , Girshick, Ross B , McAllester, David, and Ramanan, Deva Object Detection with Discriminatively Trained Part-Based Models IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627{ 1645, September 2010 (cited on 73) Ionescu, Radu Tudor Local Rank Distance InProceedings of SYNASC, pp 221{228, Timisoara, Romania, 2013 IEEE Computer Society (cited on 74) Ionescu, Radu Tudor and Popescu, Marius Objectness to improve the bag of visual words model In Proceedings of ICIP, pp 3238{3242 IEEE, 2014 (cited on 77) Ionescu, Radu Tudor and Popescu, Marius Have a SNAK Encoding Spatial Information with the Spatial Non-alignment Kernel In Proceedings of ICIAP, volume 9279, pp 97{108 Springer LNCS, 2015a (cited on 73, 80) Ionescu, Radu Tudor and Popescu, Marius PQ kernel: a rank correlation kernel for visual word histograms Pattern Recognition Letters, 55:51{57, 2015b (cited on 79) Ionescu, Radu Tudor, Popescu, Marius, and Grozea, Cristian Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition In Proceedings of ICML Workshop on Challenges in Representation Learning, 2013 (cited on 73) Koniusz, Piotr and Mikolajczyk, Krystian Spatial coordinate coding to reduce histogram representations, dominant angle and colour pyramid match In Pro- ceedings of ICIP, pp 661{664 IEEE, 2011 (cited on 76) Krapac, Josip, Verbeek, Jakob, and Jurie, Frederic Modeling Spatial Layout with Fisher Vectors for Image Categorization InProceedings of ICCV, pp 1487{1494 IEEE, November 2011 (cited on 76, 77, 87) Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean A Maximum Entropy Framework for Part-Based Texture and Object Recognition In Proceedings of ICCV, volume 1, pp 832{838, Washington, DC, USA, 2005 IEEE Computer Society (cited on 73, 85) 95 REFERENCES Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean Beyond Bags of Fea- tures: Spatial Pyramid Matching for Recognizing Natural Scene Categories In Proceedings of CVPR, volume 2, pp 2169{2178, Washington, DC, USA, 2006 IEEE Computer Society (cited on 73, 76, 79, 86, 90) Leung, Thomas and Malik, Jitendra Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons International Jour- nal of Computer Vision, 43(1):29{44, June 2001 (cited on 75, 79) Lopez-Monroy, A Pastor, y Gomez, Manuel Montes, Escalante, Hugo Jair, Cruz- Roa, Angel, and Gonzalez, Fabio A Improving the BOVW via Discriminative Visual N-Grams and MKL Strategies Neurocomputing, 2015 ISSN 0925-2312 doi: http://dx doi org/10 1016/j neucom 2015 10 053 (cited on 77) Lowe, David G Object Recognition from Local Scale-Invariant Features In Proceedings of ICCV, volume 2, pp 1150{1157, Washington, DC, USA, 1999 IEEE Computer Society (cited on 77) Lowe, David G Distinctive Image Features from Scale-Invariant Keypoints In- ternational Journal of Computer Vision, 60(2):91{110, November 2004 (cited on 77) Nister, David and Stewenius, Henrik Scalable Recognition with a Vocabulary Tree InProceedings of CVPR, volume 2, pp 2161{2168, Washington, DC, USA, 2006 IEEE Computer Society (cited on 75) Perronnin, Florent and Dance, Christopher R Fisher kernels on visual vocabular- ies for image categorization In Proceedings of CVPR IEEE Computer Society, 2007 (cited on 75, 76) Perronnin, Florent, Sanchez, Jorge, and Mensink, Thomas Improving the ﬁsher kernel for large-scale image classiﬁcation In Proceedings of ECCV, pp 143{156, Berlin, Heidelberg, 2010 Springer-Verlag (cited on 75, 76) Philbin, James, Chum, Ondrej, Isard, Michael, Sivic, Josef, and Zisserman, An- drew Object retrieval with large vocabularies and fast spatial matching In Proceedings of CVPR, pp 1{8, 2007 (cited on 73, 75, 79) 96 REFERENCES Sanchez, Jorge, Perronnin, Florent, and de Campos, Teoﬁlo Modeling the spatial layout of images beyond spatial pyramids Pattern Recognition Letters, 33(16): 2216{2223, 2012 ISSN 0167-8655 (cited on 73, 76, 77) Sivic, Josef, Russell, Bryan C , Efros, Alexei A , Zisserman, Andrew, and Free- man, William T Discovering Objects and their Localization in Images In Proceedings of ICCV, pp 370{377 IEEE Computer Society, 2005 (cited on 73, 75) Uijlings, J R R , Smeulders, A W M , and Scha, R J H What is the Spatial Extent of an Object? InProceedings of CVPR, pp 770{777, 2009 (cited on 73, 76, 77) Vedaldi, Andrea and Fulkerson, B VLFeat: An Open and Portable Library of Computer Vision Algorithms http://www vlfeat org/, 2008 (cited on 85) Vedaldi, Andrea and Zisserman, Andrew Eﬃcient additive kernels via explicit feature maps InProceedings of CVPR, pp 3539{3546, San Francisco, CA, USA, 2010 IEEE Computer Society (cited on 87) Winn, J , Criminisi, A , and Minka, T Object Categorization by Learned Uni- versal Visual Dictionary InProceedings of ICCV, volume 2, pp 1800{1807, Washington, DC, USA, 2005 IEEE Computer Society (cited on 75) Xie, Jin, Zhang, Lei, You, Jane, and Zhang, David Texture classiﬁcation via patch-based sparse texton learning InProceedings of ICIP, pp 2737{2740, Hong Kong, China, 2010 (cited on 75) Zhang, Jian, Marszalek, Marcin, Lazebnik, Svetlana, and Schmid, Cordelia Local Features and Kernels for Classiﬁcation of Texture and Object Categories: A Comprehensive Study International Journal of Computer Vision, 73(2):213{ 238, June 2007 (cited on 73, 76) 97 Chapter 4 Gesture Recognition using Local Frame Match Distance Abstract Gesture recognition using a training set of limited size for a large vocabulary of gestures is a challenging problem in computer vision With few examples per gesture class, researchers often employ state-of-the-art exemplar-based methods such as Dynamic Time Warping This chapter presents two contributions in the area of exemplar-based gesture recognition As an alternative to Dynamic Time Warping, we ﬁrst introduce the Local Frame Match Distance, a novel approach for matching gestures inspired by a distance measure for strings, namely Local Rank Distance While Local Rank Distance eﬃciently approximates the non-alignment of charactern-grams between two strings, we employ the Local Frame Match Dis- tance to eﬃciently measure the non-alignment of hand locations between two video sequences Second of all, we transform the Local Frame Match Distance into a kernel and use it in combination with Kernel Discriminant Analysis for sign lan- guage recognition with exemplars The empirical results indicate that our method can generally yield comparable performance to a state-of-the-art Dynamic Time Warping approach on the challenging task of American Sign Language recogni- tion, while reducing the computational time by 30% 98 4 1 Introduction Gesture and sign language recognition represent a challenging research area in computer vision Popular probabilistic methods such as Hidden Markov Models (HMM) [Baum & Petrie, 1966] and Conditional Random Fields (CRF) [Laﬀerty et al , 2001] require large training sets to learn good probability distributions This requirement often limits the size of the set of gestures (vocabulary) that can be recognized by such systems When a large vocabulary is desired, time constraints may force researchers to restrict the size of the training set to only a few examples per gesture class As using few examples per class prohibits the use of many statistical and machine learning methods, researchers are often limited to exemplar-based recognition and similarity measures In such cases, Dynamic Time Warping (DTW) [Kruskal & Liberman, 1999] is frequently used on hand location or other information to generate scores that serve as a measure of similarity for training examples [Corradini, 2001; Darrell & Pentland, 1993; Darrell et al , 1996; Stefan et al , 2009] DTW has been improved with the use of a well-designed feature vector that includes more than hand positions to represent the state of a gesture at each point in time [Wang et al , 2012a] In this chapter, we present an alternative solution to DTW, inspired by a distance measure for strings presented in Chapter 6, namely the Local Rank Dis- tance (LRD) [Ionescu, 2013] LRD has successfully been used for a broad range of tasks from phylogenetic analysis [Ionescu, 2013] and sequence alignment [Dinu et al , 2014] to native language identiﬁcation [Ionescu, 2015; Ionescu et al , 2016; Popescu & Ionescu, 2013] and Arabic dialect identiﬁcation [Ionescu & Popescu, 2016] LRD essentially measures the non-alignment (displacement) of character n-grams between two strings Previous results indicate that LRD is more ac- curate [Dinu et al , 2014] and can be computed faster [Ionescu, 2015] than the edit distance [Levenshtein, 1966] Since both DTW and edit distance are solved by dynamic programming, we can obtain a more eﬃcient algorithm by adapting LRD for gesture recognition from video Hence, we introduce the Local Frame Match Distance (LFMD) [Ionescu et al , 2017] algorithm to measure the distance (or similarity) between two gestures In order to use LFMD for gesture recog- nition, we propose two approaches The ﬁrst approach is to employ a k-nearest 99 neighbors algorithm The second approach is to transform LFMD into a kernel function using the squared RBF kernel [Shawe-Taylor & Cristianini, 2004] and then employ Kernel Discriminant Analysis (KDA) to train our gesture classiﬁer To the best of our knowledge, KDA has never been used for exemplar-based ges- ture recognition We compare our gesture recognition approach with a state-of- the-art approach based on DTW on the American Sign Language Lexicon Video Dataset (ASLLVD) [Athitsos et al , 2008] The empirical results indicate that applying KDA can yield better performance, while applying LFMD can reduce the computational time by 30% The rest of this chapter is organized as follows Related work on gesture and sign language recognition is presented in Section 9 2 Our learning framework is described in Section 5 3 The sign language recognition experiments are presented in Section 9 4 Finally, we draw our conclusions in Section 5 6 4 2 Related Work Most recent works have been in action and activity recognition, some from static images [Wang et al , 2012b], others from video [Tian et al , 2013] These works tend to focus on classifying small vocabularies of general actions, rather than dis- criminating between speciﬁc actions such as language signs Some action recogni- tion works do test their methods on gesture data sets [Fernando et al , 2015; Song et al , 2013], but the vocabularies are limited, and the methods are generally not directly applicable to larger vocabulary gesture sets A second area of research focuses on generalized gesture recognition The sets of gestures may be created speciﬁcally for this task, and can be chosen so as to minimize similarity between classes Long Short Term Memory (LSTM) networks have proven successful for this task [Alsharif et al , 2015] With the release of ChaLearn Gesture Challenge data set [Guyon et al , 2012], there have been a number of works in one-shot learning, in which a single training example is used per gesture class [Jiang et al , 2015; Konecny & Hagara, 2014; Wan et al , 2013] A third focus is on developing methods that work on well-established gesture sets, such as sign languages One branch of work deals in continuous sign language recognition and ﬁngerspelling [Kim et al , 2013] Another branch of sign language recognition research focuses 100 instead on classiﬁcation of individually segmented signs One popular intuitive method is to segment a sign into motion or other types of sub-units and then use an HMM to model the temporal changes in sub-units throughout each sign [Cooper et al , 2012; Wang et al , 2015] Dynamic Time Warping has also been used for action and gesture recognition [Conly et al , 2016; Reyes et al , 2016; Ste- fan et al , 2009; Wang et al , 2012a] and it has shown its superiority over LSTM and HMM models [Conly et al , 2016] Some of these works approach the idea of class variability modeling [Conly et al , 2016; Reyes et al , 2016] 4 3 Method We present a gesture and sign language recognition system, given the hand tra- jectories of the gestures We ﬁrst compute a feature matrix for each hand gesture, as described in Section 4 3 1 Next, we compute the distance of two hand gestures by employing our novel algorithm presented in Section 4 3 2 Finally, we either employ a k-nearest neighbors model or train a Kernel Discriminant Analysis clas- siﬁer to recognize new hand gestures based on a kernel derived from the pairwise distances between gestures, as detailed in Section 4 3 3 4 3 1 Feature Representation of Hand Gestures To represent a hand gesture, we use the feature vector introduced in [Wang et al , 2012a] The feature vector based on 2D hand position information is built for each video frame in order to describe what is occurring at every point in time The hand positions are ﬁrst expressed in a face-centric coordinate system For one-handed signs, the position of the non-dominant hand is set to (0; 0), hence it will not contribute to the similarity score The following features compose the vectors for each frame t of gesture video X: • Ld(X; t) and Lnd(X; t): 2D pixel coordinates of the dominant and non- dominant hands • Od(X; t) =Ld(X; t+1)Ld(X; t) andOnd(X; t) =Lnd(X; t+1)Lnd(X; t): motion direction from frame t to frame t + 1 for the dominant and non- dominant hands 101 • L(X; t) =Ld(X; t)Lnd(X; t): position of the dominant hand relative to the non-dominant hand • O(X; t) =L(X; t + 1)L(X; t): direction of change for Lfrom frame t to frame t + 1 In total, there are 12 features in the vector representing each video frame The feature vectors are combined into a single matrix to describe the sign In the experiments, we use manual annotations of the hand positions The hand gesture is size-normalized so that the diagonal of the face bounding box is 1 Finally, the frame length is normalized to 24 frames using bicubic interpolation on the feature matrix, as in [Wang et al , 2012a] Hence, the size of the feature matrix for a hand gesture becomes 12 24 4 3 2 Local Frame Match Distance In this section, we describe a novel algorithm for computing the similarity (or the distance) between the hand trajectories of two gestures Our algorithm is in- spired by LRD [Ionescu, 2013] which has successfully been applied to phylogenetic analysis [Ionescu, 2013], sequence alignment [Dinu et al , 2014], native language identiﬁcation [Ionescu, 2015; Ionescu et al , 2016; Popescu & Ionescu, 2013] and Arabic dialect identiﬁcation [Ionescu & Butnaru, 2017; Ionescu & Popescu, 2016] We next present how we adapt LRD and obtain a novel algorithm, termedLo- cal Frame Match Distance, for the task of gesture recognition Given the hand locations in each video frame, we match each hand location from the ﬁrst video sequence to the nearest hand location (in terms of the features derived from pixel coordinates) in the second video sequence Then, we compute the sum of the absolute diﬀerences between the indexes of matched frames As LRD operates on character n-grams in order to yield better performance, we can also extend the LFMD algorithm to match sets of consecutive hand locations to achieve the same goal Local Frame Match Distance is formally presented in Algorithm 5 We use the following notations for describing the algorithm An array (or an ordered set of elements) is denoted by V = (v1; v2; ::::; vn) and the length of V is denoted byjVj = n Arrays are considered to be indexed starting from position 1, thus V [i] = vi;8i2 f1; 2; :::ng We extend this notation to matrices as well, 102 therefore we consider that M[i; j] represents the element on row i and column j of the matrix M The goal of Algorithm 5 is to compute a distance between two hand trajecto- ries represented as feature matrices X and Y As LRD obtains generally better results when matching character n-grams instead of single characters, we also want to match a set of consecutive frames in X with another set of consecutive frames in Y by minimizing a cost function For the sake of simplicity, we will refer to a set of consecutive frames Xi=fXi; Xi:::; Xig as a p-frame, :i+p1+1+p1 wherep denotes the number of frames considered in the set denoted byXi :i+p1 For individual frames Xiand Yj, we employ the same cost function as in [Wang et al , 2012a], but we assign equal weights to all the features, therefore eliminating the weights and the need to tune them on a validation set: cost(Xi; Yj) =kLd(X; i)Ld(Y; j)k+ 2 +kLnd(X; i)Lnd(Y; j)k+ 2 +kOd(X; i)Od(Y; j)k+ 2 (4 1) +kOnd(X; i)Ond(Y; j)k+ 2 +kL(X; i)L(Y; j)k+ 2 +kO(X; i)O(Y; j)k; 2 wherekkrepresents theL2-norm Forp-frames XiandYj, we naively 2:i+p1:j+p1 consider the cost given by diagonally aligning the individual frames: cost(Xi; Yj) =cost(Xi; Yj)+ :i+p1:j+p1 +cost(Xi; Yj) +:::+(4 2) +1+1 +cost(Xi; Yj): +p1+p1 In a similar way to DTW, we build a cost matrixC for each pair ofp-frames in X and Y Figure 4 1 illustrates the cost matrix computed using DTW Nonethe- less, the process for computing the cost matrix is slightly diﬀerent for LFMD We ﬁrst pre-compute some of the components of the matrix C in steps 18-24 of Algorithm 5 The rest of the components are computed in steps 25-30 It is important to note that we compute only those components for which the ab- 103 Algorithm 5: LFMD Algorithm 1Input: 2 X, Y { the input feature matrices for two hand gestures; 3 n { the number of frames of the two hand gestures; 4 p { the number of consecutive frames to be matched; 5 m { the maximum spatial oﬀset (mn); 6Notations: 7 C { the (np + 1) (np + 1) cost matrix of all possible pairs of p-frames from X and Y ; 8 MX{ the vector with minimal costs for each p-frame in X; 9 JX{ the indexes of the p-frames in Y corresponding to the minimal costs in MX; 10 MY{ the vector with minimal costs for each p-frame in Y ; 11 JY{ the indexes of the p-frames in X corresponding to the minimal costs in MY; 12Initialization: 13 MX (1;1; ::::;1), such thatjMXj = np + 1; 14 MY (1;1; ::::;1), such thatjMYj = np + 1; 15 JX (0; 0; ::::; 0), such thatjJXj = np + 1; 16 JY (0; 0; ::::; 0), such thatjJYj = np + 1; 17Computation: 18if p 2 then 19for i2 f1; :::; minfm; npggdo 20C[i + p 2; p 1] cost(Xi; Yp); +p21 21C[p 1; i + p 2] cost(Xp; Yi); 1+p2 22for j 2 fp 2; :::; 1gdo 23C[i + j 1; j] cost(Xi; Yj) +C[i + j; j + 1]; +j1 24C[j; i + j 1] cost(Xj; Yi) +C[j + 1; i + j]; +j1 25for i2 f1; :::; np + 1gdo 26for j 2 f1; :::; np + 1gdo 27ifjijj m then 28C[i + p 1; j + p 1] cost(Xi; Yj); +p1+p1 29for k2 fp 2; :::; 0gdo 30C[i + k; j + k] C[i + k; j + k] +C[i + p 1; j + p 1]; 31if C[i; j] 1 do 21mid b(beg + end)=2e; 22if i = h(x[i : i + p])[mid] then 23beg mid; 24end mid; 25else 26if i ; ABB => ; BBA => ; ABA => g The next step is to take the 3-grams from s1and look them up in h: •The ﬁrst 3-gram froms1, namelyABA, is found at position 5 in s2according to h, so the oﬀsetj1 5j is added to the distance •The second 3-gram from s1, namely BAB, is found at positions in s2 according to h In this case, a binary search in the array is employed to ﬁnd the closest position to 2 The closest position is 1, so the oﬀsetj21j is added to the distance In conclusion, l(s1; s2) is the sum ofj1 5j and j2 1j, namely 5 ef t 6 4 Local Rank Distance Sequence Aligners This section introduces two sequence aligners that work under Local Rank Dis- tance The ﬁrst one is based on a basic algorithm that aligns a read of length l against a reference DNA sequence of length n For eﬃciency reasons, it actually computes only lfrom Deﬁnition 6 between the read and a certain substring ef t from the reference genome It is perfectly reasonable to use only one of the two partial sums, lor r, since the symmetric property of LRD is no longer ef tight needed in the context of sequence alignment 167 Algorithm 7: LRD Sequence Aligner 1Input: 2 r { a short DNA string of length l; 3 s { a reference DNA sequence of length n; 4 k { the size of the k-mers to be compared; 5 m { the maximum oﬀset; 6 th { the maximum distance threshold accepted for the aligned read; 7 d { the threshold that can be adjusted to skip the alignment at some positions 8Initialization: 9 th; min 10 bestP os 0; 11Computation: 12for i2 f1; :::; lk + 1gdo 13add i in the array stored at h(r[i : i + k]); 14for i2 f1; :::; nk + 1gdo 15ifjh(s[i : i + k])j> 0 then 16f[i] true; 17else 18f[i] false; 19 count 0; 20for i2 f1; :::; lk + 1gdo 21if f[i] = true then 22count count + 1; 23c count; 24for i2 f2; :::; nl + 1gdo 25if f[i 1] = true then 26count count 1; 27if f[i + lk + 1] = true then 28count count + 1; 29c[i] count; 30for i2 f1; 2; :::; nl + 1gdo 31if c[i] max fcg d and (jrj kc[i]) m minthen 35abort and proceed to the next value of i in the loop from step 30; 36else 37if f[i + j 1] = true then 38do a binary search in the array stored ath(s[i + j 1 :i + j 1 +k]) to obtain the position p that minimizesjjpj; 39 + minfjjpj; mg; 40else 41 +m; 42if 0 tf(t; d) =:(7 1) 0;if ft;d= 0 To make things completely clear, an example is given next Indeed, Example 7 shows how to compute the term frequency in a particular case Example 7Given a document d =\He lives in a big house with a big garage in a big city " and a term t =\big", the number of occurrences of t in d is ft;d= 3 1 Stop words are the most common words in a language, usually function words, such as what, is, this 2 Stemming is the process that reduces a word to its root form 205 Hence, the log-normalized term frequency is: tf(t; d) = 1 +logft;d= 1 +log3 1 + 0:4771 1:4771: 7 3 1 Spatial Pyramid for Text The work of [Lazebnik et al , 2006] presents a method for recognizing scene cat- egories in images based on aggregating statistics of local features (visual words) over ﬁxed sub-regions (bins) Their technique works by partitioning the image into bins and computing histograms of visual words found inside each bin This process is repeated at multiple levels, and the resulted histograms are concate- nated into a single representation known as the spatial pyramid For example, if the spatial pyramid is based on three levels, the convention followed by [Lazebnik et al , 2006] is to divide the image into 1 1, 2 2, and 4 4 bins In other words, the number of bins at level l is 2l1 2l1 At the ﬁrst level there is a single bin that coincides with the entire image to be analyzed The spatial pyramid representation is a simple and eﬃcient extension of the bag-of-visual-words representation, that contains information about the visual words that appear in a predeﬁned region of the input image In a similar fashion, a spatial pyramid for text can be developed More precisely, at each level l of the spatial pyramid, the text is divided intol parts of equal length, and a bag-of- words representation is computed for each part The bag-of-words representation contains log-normalized term frequencies of the terms that appear in a given part of the input text documents Thus, it can also be described as a word histogram It is worth noting that we do not keep the convention to use 2l1bins at level l in the spatial pyramid for text This is primarily motivated by the fact that a text can naturally be structured into a number of parts that is not necessarily a power of two For example, an essay can be divided into an introduction, a body, and a conclusion The narrative structure of a novel can also be divided into three sections known as the setup, the conﬂict, and the resolution Although the spatial pyramid approach is based on a naive approach that does not involve dividing the text into meaningful parts, it has a greater chance of approximating these meaningful parts if the text is divided into l parts at a level l 206 Let L be the number of levels chosen for the spatial pyramid The total number of word histograms T is: T= 1 + 2 +::: + L =L(L + 1): 2 The word histograms are concatenated into a single feature vector that repre- sents the entire text document The ﬁnal feature vector is termed spatial pyramid for text Given a vocabulary of terms V , the number of features in the spatial pyramid isjVj T , wherejVj is the size of the vocabulary For example, using a spatial pyramid based on three levels, will generate a representation that is six times larger than the standard BOW representation (T= 1 + 2 + 3 = 6) After the spatial pyramid representation for text is obtained, the ﬁnal im- plementation issue that needs to be settled is the normalization The same ap- proach as [Lazebnik et al , 2006] is adopted here More precisely, all histograms are normalized by the total weight of all words in the text, which gives maximum computational eﬃciency This kind of normalization is enough to deal with the eﬀects of variable text lengths Last but not least, it is worth mentioning the spatial pyramid is a fairly simple approach, being very easy to implement and use in many practical applications 7 3 2 Spatial Non-Alignment Kernel for Text The Spatial Non-Alignment Kernel (SNAK) is a framework presented in Chap- ter 3 that includes spatial information into the bag-of-visual-words model Along with Local Patch Dissimilarity [Dinu et al , 2012] and Local Rank Distance [Ionescu, 2013], it stems from the idea of measuring the spatial non-alignment among two objects In computer vision, the SNAK framework can roughly determine the spatial non-alignment between two images by measuring how the spatial distri- bution of each visual word varies in the two images In text mining, the SNAK framework can be adapted in a straightforward manner to measure the spatial non-alignment between two text documents by considering words instead of visual words As in computer vision, additional information for each word needs to be stored 207 in the feature representation of a text document Indeed, the average position and the standard deviation of all the occurrences of a word in the text document need to be computed While in the case of image data these statistics are computed for each of the two image coordinates, this is no longer necessary for text data The SNAK feature vector of a text document includes the average position and the standard deviation of a term together with the log-normalized frequency of the respective term, resulting in a feature space that is three times greater than the original feature space corresponding to the standard bag-of-words The size of the feature space is identical to a spatial pyramid for text based on two levels, but it is two times smaller than a spatial pyramid based on three levels Let U represent the SNAK feature vector of a text document For each term at an index i in a vocabulary, U will contain triplets as deﬁned below: u(i) = (tfu(i); mu(i); su(i)); where the ﬁrst component of u(i) represents the log-normalized term frequency as deﬁned in Equation (8 2), m(i) represents the mean (or average) position of the i-th term, and s(i) represents the standard deviation of the i-th term It is important to note that the last two components of u(i) are normalized with respect to the length of the text document, to reduce the eﬀects of variable text lengths in a collection of documents If the visual word i does not appear in the text document (tfu(i) = 0), the last two components are undeﬁned In fact,m(i) ands(i) are not being used at all, iftfu(i) = 0 Example 8 shows how to compute a triplet for a certain term that appears in a text document Example 8Given a document d =\He lives in a big house with a big garage in a big city " and a termt =\big", it can be easily observed thatt appears precisely three times in d at positions 5, 9 and 13, respectively The length of d is 14 Let Urepresent the SNAK feature vector of d, and let i denote the index of t in the 208 vocabulary of terms The components in the triplet u(i) are computed as follows: tfu(i) = 1 +log3 1 + 0:4771 1:4771; mu(i) =5 + 9 + 131=9 0:6429; 31414 r r su(i) =1((5 9)2+ (9 9)2+ (13 9)2)1=321 0:2857: 3 114214 Finally, the triplet corresponding to the term t is: u(i) (1:4771; 0:6429; 0:2857): As in the case of visual words, the SNAK kernel between two feature vectors Uand V can be deﬁned as follows: n X kSNAK(U; V ) =exp (c1(u(i); v(i))) exp (c2(u(i); v(i))); mean std i=1 (7 2) where n is the number of terms in the vocabulary, c1and c2are two parameters with positive values, u(i) is the triplet in U corresponding to the i-th term in the vocabulary, v(i) is the triplet in V corresponding to the i-th term in the vocabulary, and meanand stdare deﬁned as follows: (2 (mumv);if tfu; tfv> 0 mean(u; v) = 1; otherwise (2 (susv);if tfu; tfv> 0 std(u; v) = 1; otherwise wherem and s are components of the tripletsu and v If a term does not appear in at least one of the two compared text documents, its contribution to kSNAKis zero, as meanand stdare inﬁnite Since the deﬁnition of SNAK for text data is essentially the same as in computer vision, it remains a kernel function As in computer vision, the SNAK framework is a fairly simple approach, that can be easily generalized and combined with many other kernel functions The following 209 equation shows how to combine SNAK with the another kernelkthat takes into account the log-normalized term frequency: n Xu kSNAK(U; V ) =k(tf(i); tfv(i)) i=1(7 3) exp (c1(u(i); v(i))) exp (c2(u(i); v(i))): mean std In the experiments, the linear kernel is used as kin Equation (7 3) 7 4 Experiments 7 4 1 Data Sets Description The Reuters-21578 corpus is the ﬁrst data set used in the evaluation The Reuters-21578 corpus [Lewis, 1997] is one of the most widely used test collec- tions for text categorization research It contains 21; 578 articles collected from Reuters newswire Following the procedure of [Joachims, 1998; Yang & Liu, 1999], the categories that have at least one document in the training set and one in the test set are selected This leads to a total of 90 categories Two evaluation modes are then used In the ﬁrst mode, unlabeled documents are eliminated After re- moving the unlabeled documents, there are 10; 787 documents left that belong to 90 categories Each document belongs to one or more categories and the average number of categories per document is 1:235 The collection is split into 7; 768 documents in the training set and 3; 019 documents in test set In the second evaluation mode, the unlabeled documents are kept in the collection Hence, the second evaluation mode is a little more diﬃcult The second evaluation mode leads to a corpus of 9; 598 training documents and 3; 299 test documents Using two slightly diﬀerent evaluation modes is motivated by the fact that previous works use one of the two modes, but the results are not directly comparable For instance, the ﬁrst evaluation mode is used in [Xue & Zhou, 2009], while the second evaluation mode is used in [Joachims, 1998] The second corpus used in the evaluation is 20 Newsgroups The 20 News- groups corpus [Lang, 1995] contains 19; 997 articles taken from the Usenet news- 210 group collections Following the approach of [Schapire & Singer, 2000; Xue & Zhou, 2009], the duplicate documents are removed There are 19; 465 documents left, that belong to 20 categories The evaluation is based on 4-fold cross vali- dation, following the same procedure as [Bekkerman et al , 2003; Xue & Zhou, 2009] 7 4 2 Implementation Choices The bag-of-words representation is obtained by eliminating the stop words and by stemming the rest of the words The standard bag-of-words is used as a baseline model in the text categorization experiments In computer vision, the spatial pyramid is usually based on two or three levels in practical situations It has been observed that adding more pyramid levels [Lazebnik et al , 2006] does not necessarily increase performance, and it becomes hard to compensate for the fact that it requires more space Thus, the spatial pyramids evaluated in the following experiments are based on two and three levels, respectively The SNAK frame- work takes both the average position and the standard deviation of each term into account In object recognition from images, empirical results demonstrated that they have an almost equal contribution to the proposed framework Hence, the two constants c1and c2from Equation (7 3) are set to the same value in the text categorization experiments Since these statistics are normalized with respect to the document length, a good choice for c1and c2is 0:5 Although no tuning is performed in the case of SNAK, it is worth mentioning that tuning the parameters c1and c2is likely to improve performance As the other two evaluated methods (standard BOW and spatial pyramid) do not require tuning, it is perhaps better to refrain from tuningc1andc2for a fair comparative study In all the experiments, Kernel Ridge Regression is the method of choice for the learning stage 7 4 3 Evaluation Procedure To evaluate and compare the text categorization approaches, the precision and the recall are ﬁrst computed based on the confusion matrix presented in Table 8 3 Theprecisionis given by the number of true positive documents (T P ) divided 211 Table 7 1: Confusion matrix (also known as contingency table) of a binary clas- siﬁer with labels +1 or1 Expert judgments Labels+11 Classiﬁer+1true positive (T P)false positive (F P) predictions1false negative (F N)true negative (T N) by the number of documents predicted as positive by the classiﬁer (T P + F P ), while the recallis given by the number of true positive documents (T P ) divided by the total number of documents marked as positive by a trusted expert judge (T P + F N) To capture precision and recall into a single representative number, theF1measure can be employed TheF1measure can be interpreted as a weighted average of the precision and recall given by: F1= 2precisionrecall: precision + recall For each category, a binary classiﬁer is trained to predict the positive and negative labels for the test documents However, the performance of the classiﬁer needs to be evaluated at the global level (over all categories) Two approaches are used in literature to aggregate the F1measure over multiple categories One is based on computing a confusion matrix for each category, which can be used to subsequently calculate the F1measure for each category Finally, the global F1 measure is obtained by averaging all theF1measures This ﬁrst measure is known as macro-averaged F1(macroF1) The other approach is based on computing a global confusion matrix for all the categories by summing the documents that fall in each of the four conditioned sets, namely true positives, true negatives, false positives, and false negatives The global F1measure is immediately computed with the values provided by the global confusion matrix This second measure is known as micro-averaged F1(microF1) As noted by [Xue & Zhou, 2009], the classiﬁer's performance on rare categories has more impact on the macro-averaged F1measure, while the performance on common categories has more impact on the micro-averaged F1measure Thus, it makes sense to report both these measures 212 Table 7 2: Empirical results on the Reuters-21578 corpus obtained by the stan- dard bag-of-words versus two methods that encode spatial information, namely spatial pyramids and SNAK The macro-averaged and the micro-averaged F1 measures are reported for two evaluation modes, one that includes unlabeled doc- uments and one that excludes unlabeled documents The learning is always done by KRR The best scores are highlighted in bold The marker * indicates that the performance is signiﬁcantly better than the baseline according to a Student's t-test performed at a signiﬁcance level of 0:01 Unlabeled documentsUnlabeled documents includedexcluded RepresentationmicroF1macroF1microF1macroF1 Histogram0:8650:5110:8750:523 Spatial Pyramid (2 levels)0:870*0:520*0:882*0:537* Spatial Pyramid (3 levels)0:872*0:529*0:881*0:538* SNAK0:877*0:549*0:886*0:561* in the following experiments 7 4 4 Results on Reuters-21578 Corpus The text categorization results on the Reuters-21578 corpus are presented in Table 7 2 The macro-averaged and the micro-averagedF1measures are reported for two evaluation modes, one that includes unlabeled documents and one that excludes them The standard word histogram representation is compared with two approaches that incorporate spatial information into the bag-of-words model In both evaluation modes, the spatial pyramids and the SNAK framework improve performance over the standard bag-of-words When the unlabeled documents are included in the experiment, the spatial pyramid based on three levels works slightly better than the spatial pyramid based on two levels For example, the macro-averaged F1measure grows by nearly 1% (from 0:511 to 0:520) when using the spatial pyramid based on two levels, and by nearly 2% (from 0:511 to 0:529) when using the spatial pyramid based on three levels The improvements in terms of the micro-averaged F1measure are not equally high Interestingly, the SNAK approach attains better performance 213 than both spatial pyramids Its results are even better than the spatial pyramid based on three levels, which requires twice the space The macro-averaged F1 score given by SNAK is almost 4% better than the standard bag-of-words and 2% better than the spatial pyramid based on three levels The micro-averagedF1 score given by SNAK is also roughly 1% better than the standard representation and 0:5% better than the spatial pyramid based on three levels The relative improvements provided by spatial pyramids and by SNAK are about the same when the unlabeled documents are excluded from the experiment The only diﬀerence is that the spatial pyramid based on two levels gives almost identical results to the spatial pyramid based on three levels Even so, they are both able to improve performance by a signiﬁcant margin More precisely, they attain macro-averaged F1scores that are nearly 1:5% over the word histogram, and micro-averagedF1scores that are nearly 0:7% over the word histogram It is important to note that all methods give better results when the unlabeled docu- ments are excluded, since the task becomes a little more easy (the classiﬁer has a lower chance of making a mistake) As in the previous evaluation mode, the SNAK framework yields better performance than both spatial pyramids More- over, the improvements of the SNAK approach over the other representations are consistent in both evaluation modes 7 4 5 Results on 20 Newsgroups The results obtained on the 20 Newsgroups corpus using the 4-fold cross validation procedure are presented in Table 7 3 On this corpus, the micro-averagedF1and macro-averagedF1measures are almost identical for each individual method The KRR based on standard bag-of-words model already achieves a good performance, as it reaches a micro-averagedF1of 0:941 and a macro-averagedF1of 0:940 The spatial pyramid based on two levels yields a performance increase of almost 0:5% However, the performance gain seems to saturate for the spatial pyramid based on three levels, since it acquires identical performance to the spatial pyramid based on two levels The SNAK framework is able to attain the best performance on the 20 Newsgroups corpus, as it reaches a macro-averaged F1of 0:948 and a micro-averagedF1of 0:948 Remarkably, the results on the 20 Newsgroups corpus 214 Table 7 3: Empirical results on the 20 Newsgroups corpus obtained by the stan- dard bag-of-words versus two methods that encode spatial information, namely spatial pyramids and SNAK The macro-averaged and the micro-averaged F1 measures are reported for the 4-fold cross validation procedure The learning is always done by KRR The best scores are highlighted in bold The marker * indicates that the performance is signiﬁcantly better than the baseline according to a Student's t-test performed at a signiﬁcance level of 0:01 RepresentationmicroF1macroF1 Histogram0:9410:940 Spatial Pyramid (2 levels)0:945*0:945* Spatial Pyramid (3 levels)0:945*0:945* SNAK0:948*0:948* are consistent with the results reported on the Reuters-21578 corpus 7 5 Discussion Two methods for including spatial information into the widely used bag-of-words model have been described in this chapter Both of them are inspired by research in computer vision The spatial pyramid is perhaps the best known method for including spatial information in the bag-of-visual-words The SNAK framework is a recent development of [Ionescu & Popescu, 2015] that exhibits better per- formance in object recognition from images than the popular spatial pyramid, while being more compact in terms of space In this chapter, the spatial pyra- mid and the SNAK framework have been adapted to text data Moreover, these two frameworks have been used for the ﬁrst time in a text mining task, namely text categorization by topic The empirical results on the Reuters-21578 and the 20 Newsgroups data sets indicate that both the spatial pyramid and the SNAK framework can signiﬁcantly improve performance over the standard bag-of-words Overall, the best results are obtained by the SNAK framework Remarkably, the SNAK framework was also found to work better than the spatial pyramid in computer vision, according to [Ionescu & Popescu, 2015] Thus, an interesting pattern seems to take shape More precisely, it appears that the SNAK frame- 215 REFERENCES work gives better results regardless of the data type, image or text Given that it is more compact than a spatial pyramid based on three levels, it should probably be preferred in favor of the spatial pyramid However, it is important to mention that spatial information should not be expected to improve performance in ev- ery text mining task For example, computer vision researchers have found that spatial pyramids are not useful in texture classiﬁcation from images The spa- tial pyramid recovers some information about the location of objects in images, but this information is useless in texture analysis, since the patterns that form a certain type of texture are uniform across the entire area covered by the respec- tive texture In a similar way, spatial information may be found to be worthless in some speciﬁc text mining tasks Hence, the proposed approaches should be evaluated on each individual task before drawing a general conclusion References Bekkerman, Ron, El-Yaniv, Ran, Tishby, Naftali, and Winter, Yoad Distribu- tional Word Clusters vs Words for Text Categorization Journal of Machine Learning Research, 3:1183{1208, March 2003 ISSN 1532-4435 (cited on 212) Dinu, Liviu P , Ionescu, Radu Tudor, and Popescu, Marius Local Patch Dissimi- larity for Images In Proceedings of ICONIP, volume 7663, pp 117{126 LNCS Springer-Verlag, 2012 (cited on 208) Grauman, Kristen and Darrell, Trevor The Pyramid Match Kernel: Discrim- inative Classiﬁcation with Sets of Image Features InProceedings of ICCV, volume 2, pp 1458{1465, 2005 (cited on 204) Ionescu, Radu Tudor Local Rank Distance InProceedings of SYNASC, pp 221{228, Timisoara, Romania, 2013 IEEE Computer Society (cited on 208) Ionescu, Radu Tudor and Popescu, Marius Have a SNAK Encoding Spatial Information with the Spatial Non-alignment Kernel In Proceedings of ICIAP, volume 9279, pp 97{108 Springer LNCS, 2015 (cited on 202, 203, 205, 216) 216 REFERENCES Joachims, Thorsten Text Categorization with Suport Vector Machines: Learning with Many Relevant Features In Proceedings of ECML, pp 137{142, London, UK, UK, 1998 Springer-Verlag (cited on 202, 211) Johnson, Rie and Zhang, Tong Eﬀective Use of Word Order for Text Catego- rization with Convolutional Neural Networks In Proceedings of NAACL, pp 103{112, 2015 (cited on 202, 204, 205) Koniusz, Piotr and Mikolajczyk, Krystian Spatial coordinate coding to reduce histogram representations, dominant angle and colour pyramid match In Pro- ceedings of ICIP, pp 661{664 IEEE, 2011 (cited on 202) Krapac, Josip, Verbeek, Jakob, and Jurie, Frederic Modeling Spatial Layout with Fisher Vectors for Image Categorization InProceedings of ICCV, pp 1487{1494 IEEE, November 2011 (cited on 202) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoﬀrey E ImageNet Classiﬁca- tion with Deep Convolutional Neural Networks InProceedings of NIPS, pp 1106{1114, 2012 (cited on 205) Lang, Ken NewsWeeder: Learning to Filter Netnews In Proceedings of ICML, pp 331{339, 1995 (cited on 211) Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean Beyond Bags of Fea- tures: Spatial Pyramid Matching for Recognizing Natural Scene Categories In Proceedings of CVPR, volume 2, pp 2169{2178, Washington, DC, USA, 2006 IEEE Computer Society (cited on 202, 203, 204, 205, 207, 208, 212) Lewis, David The reuters-21578 text categorization test collection http://www daviddlewis com/resources/testcollections/reuters21578/, 1997 (cited on 211) Manning, Christopher D and Schutze, Hinrich Foundations of Statistical Natural Language Processing MIT Press, Cambridge, MA, USA, 1999 (cited on 202) Manning, Christopher D , Raghavan, Prabhakar, and Schutze, Hinrich Intro- duction to Information Retrieval Cambridge University Press, New York, NY, USA, 2008 (cited on 202) 217 REFERENCES Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar Thumbs Up? Sentiment Classiﬁcation Using Machine Learning Techniques In Proceedings of EMNLP, volume 10, pp 79{86, Stroudsburg, PA, USA, 2002 Association for Computa- tional Linguistics (cited on 202) Porter, Martin F An algorithm for suﬃx stripping Program, 14(3):130{137, 1980 (cited on 206) Pu, Wen, Liu, Ning, Yan, Shuicheng, Yan, Jun, Xie, Kunqing, and Chen, Zheng Local Word Bag Model for Text Categorization In Proceedings of ICDM, pp 625{630, Los Alamitos, CA, USA, 2007 IEEE Computer Society (cited on 202, 203, 204) Sanchez, Jorge, Perronnin, Florent, and de Campos, Teoﬁlo Modeling the spatial layout of images beyond spatial pyramids Pattern Recognition Letters, 33(16): 2216{2223, 2012 ISSN 0167-8655 (cited on 202) Schapire, Robert E and Singer, Yoram BoosTexter: A Boosting-based Systemfor Text Categorization Machine Learning, 39(2):135{168, May 2000 ISSN 0885- 6125 (cited on 212) Sebastiani, Fabrizio Machine Learning in Automated Text Categorization ACM Computing Surveys, 34(1):1{47, March 2002 (cited on 202) Simonyan, K and Zisserman, A Very Deep Convolutional Networks for Large- Scale Image Recognition In Proceedings of ICLR, 2014 (cited on 205) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew Going Deeper With Convolutions In Proceedings of CVPR, pp 1{9, June 2015 (cited on 205) Szeliski, Richard Computer Vision: Algorithms and Applications Springer- Verlag New York, Inc , New York, NY, USA, 1st edition, 2010 (cited on 204) Tan, Chade-Meng, Wang, Yuan-Fang, and Lee, Chan-Do The use of bigrams to enhance text categorization Information Processing & Management, 38(4): 529{546, 2002 (cited on 202, 203, 204) 218 REFERENCES Uijlings, J R R , Smeulders, A W M , and Scha, R J H What is the Spatial Extent of an Object? InProceedings of CVPR, pp 770{777, 2009 (cited on 202) Xue, Xiao-Bing and Zhou, Zhi-Hua Distributional features for text categoriza- tion IEEE Transactions on Knowledge and Data Engineering, 21(3):428{442, March 2009 (cited on 202, 204, 205, 211, 212, 213) Yang, Yiming and Liu, Xin A re-examination of text categorization methods In Proceedings of SIGIR, pp 42{49, New York, NY, USA, 1999 ACM (cited on 211) 219 Chapter 8 Text Classiﬁcation using Bag-of-Super-Word-Embeddings Abstract In this chapter, we present a novel approach for text classiﬁcation based on clus- tering word embeddings, inspired by the bag-of-visual-words model, which is widely known in computer vision After each word in a collection of documents is rep- resented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a ﬁxed-size set of clusters The centroid of each cluster is interpreted as asuper word embedding that embodies all the semantically related word vectors in a certain region of the embedding space Every embedded word in the collection of documents is then assigned to the nearest cluster centroid In the end, each document is represented as a bag-of-super-word-embeddings by computing the frequency of each super word embedding in the respective document We also diverge from the idea of building a single vocabulary for the entire collection of documents, and propose to build class-speciﬁc vocabularies for better performance Using the proposed representa- tion, we report results on three text mining tasks, namely text categorization by topic, polarity classiﬁcation and automatic essay scoring On the ﬁrst two tasks, our model yields better performance than the standard bag-of-words On the third task, automatic essay scoring, we combine the bag-of-super-word-embeddings with 220 string kernels and we report the best performance on the Automated Student As- sessment Prize data set, in both in-domain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches 8 1 Introduction With the recent exponential growth of the Internet, there is more and more data that requires eﬃcient processing methods for storing and extracting relevant in- formation This data is usually unstructured or semi-structured, and comes in diﬀerent forms, such as images or texts In order to process larger and larger amounts of data, researchers need to develop new techniques that can extract relevant information and infer some kind of structure from the available data In text processing, implementing a simple bag-of-words(BOW) model to repre- sent a collection of documents can prove to be useful in tasks such as sentiment analysis [Pang et al , 2002], text categorization [Joachims, 1998] or information retrieval [Manning et al , 2008] On the other hand, in order to process images, one should ﬁrst ﬁnd salient features before extracting them The features can either be determined by experts in the speciﬁc domain of the application, or by a technique termed representation learning, where the features are discovered au- tomatically [Bengio, 2009; Krizhevsky et al , 2012; Montavon et al , 2012] As text documents, images can be represented using the bag-of-words model, but a wordhas a completely diﬀerent meaning and interpretation than in string and text processing In fact, in computer vision, this model is known as the bag-of- visual-words (BOVW) [Csurka et al , 2004; Sivic et al , 2005; Zhang et al , 2007], and a visual word is usually deﬁned as a cluster of similar image descriptors [Bay et al , 2008; Dalal & Triggs, 2005; Lowe, 1999, 2004] extracted from the images In recent years, researchers have developed eﬀective ways [Mikolov et al , 2013] for representing words as vectors Word embeddings [Bengio et al , 2003; Col- lobert & Weston, 2008; Mikolov et al , 2013] have had a huge impact in natural language processing (NLP) and related ﬁelds, being used in many tasks including information retrieval [Clinchant & Perronnin, 2013; Ye et al , 2016], sentiment analysis [Dos Santos & Gatti, 2014] and word sense disambiguation [Bhingardive et al , 2015; Butnaru et al , 2017; Chen et al , 2014; Iacobacci et al , 2016], among 221 many others In this chapter, we consider word embeddings from a diﬀerent per- spective by drawing our inspiration from computer vision Our aim is to redesign an eﬃcient computer vision technique and use it for natural language processing tasks by leveraging the use of word embeddings More speciﬁcally, we interpret word embeddings as text descriptors, which allows us to adapt computer vision techniques based on local image descriptorssuch as SIFT [Lowe, 1999, 2004] or SURF [Bay et al , 2008] In computer vision, alocal image descriptoris a vi- sual unit that represents a small image region by its elementary characteristics such as shape, color or texture In natural language processing, word embeddings capture the semantic similarities between linguistic items and atext descriptor (word vector) is a textual unit that represents a word by its semantic character- istics Based on this analogy, we propose a novel approach for text classiﬁcation inspired by the bag-of-visual-words model Our approach is diﬀerent from the standard bag-of-words model used in natural language processing Instead of using a vocabulary of words, we build a vocabulary ofsuper word vectorsby clustering word vectors with k-means Hence, a document will be represented as abag-of-super-word-embeddings(BOSWE) [Butnaru & Ionescu, 2017] We also diverge from the idea of building a single vocabulary for the entire collection of documents, and propose to build a set of class-speciﬁc vocabularies ofsuper word vectors, by using the class labels earlier in the training process, in order to separate the samples before applying the k-means clustering This latter ap- proach seems to give better results in practice In this chapter, we also approach a regression problem, namely automatic essay scoring, for which it is impossi- ble to deﬁne class-speciﬁc vocabularies In the learning stage, we employ kernel methods We try out several kernels, such as the linear kernel, the intersection kernel, the Hellinger's kernel, the Jensen-Shannon kernel, and the relatively new PQ kernel [Ionescu & Popescu, 2013, 2015b] We conduct several experiments on text categorization by topic and polarity classiﬁcation to demonstrate the eﬀectiveness of our representation compared to a standard bag-of-words We also combine BOSWE with string kernels and ap- proach a regression task, namely automatic essay scoring (AES) Automatic essay scoring is the task of assigning grades to essays written in an educational setting, using a computer-based system with natural language processing capabilities 222 The aim of designing such systems is to reduce the involvement of human graders as far as possible AES is a challenging task as it relies on grammar as well as se- mantics, pragmatics and discourse [Song et al , 2017] Although traditional AES methods typically rely on handcrafted features [Attali & Burstein, 2006; Chen & He, 2013; Dikli, 2006; Foltz et al , 1999; Larkey, 1998; Phandi et al , 2015; Somasundaran et al , 2014; Wang & Brown, 2008; Yannakoudakis et al , 2014], recent results indicate that state-of-the-art deep learning methods reach better performance [Alikaniotis et al , 2016; Dong & Zhang, 2016; Dong et al , 2017; Song et al , 2017; Taghipour & Ng, 2016; Tay et al , 2018], perhaps because these methods are able to capture subtle and complex information that is relevant to the task [Dong & Zhang, 2016] Since recent methods based on string kernels have demonstrated remarkable performance in various text classiﬁcation tasks ranging from authorship identiﬁcation [Popescu & Grozea, 2012] and sentiment analysis [Gimenez-Perez et al , 2017; Popescu et al , 2017] to native language identiﬁca- tion [Ionescu et al , 2014b, 2016] and dialect identiﬁcation [Ionescu & Butnaru, 2017; Ionescu & Popescu, 2016], we believe that string kernels can reach equally good results in AES As string kernels are a simple approach that relies solely on character n-grams as features, it is fairly obvious that such an approach will not to cover several aspects (e g : semantics, discourse) required for the AES task To solve this problem, we propose to combine string kernels with our approach based on word embeddings, as shown in [Cozma et al , 2018] To our knowledge, this is the ﬁrst successful attempt to combine string kernels and word embeddings We evaluate our approach on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings The empirical results indicate that our approach yields a better performance than several state-of-the-art approaches [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018] The rest of this chapter is organized as follows Section 8 2 presents related work from computer vision and natural language processing The bag-of-super- word-embeddings is presented in Section 8 3 Using BOSWE, we present ex- periments on polarity classiﬁcation in Section 8 4 and on text categorization by topic in Section 8 5 Using BOSWE and string kernels, we present experiments on automatic essay scoring in Section 8 6 Finally, we draw our conclusions in Section 8 7 223 8 2 Related Work 8 2 1 Bag-of-Visual-Words Despite of the traditional view that computer vision and text processing are sep- arate and unrelated ﬁelds of study, there are many cases in which text and images can be treated in a similar manner, as detalied in Capter 1 One such example is the bag-of-words representation The bag-of-words model represents a text as an unordered collection of words, completely disregarding grammar, word order, and syntactic groups It has many applications from information retrieval [Manning et al , 2008] to natural language processing [Manning & Schutze, 1999] and word sense disambiguation [Agirre & Edmonds, 2006] In order to use a bag-of-words representation for computer vision tasks, researchers have introduced the concept of visual wordby vector quantizing local image descriptors, such as SIFT [Lowe, 1999, 2004] or SURF [Bay et al , 2008] The vector quantization process can be done, for example, by k-means clustering [Leung & Malik, 2001] or by probabilis- tic Latent Semantic Analysis [Sivic et al , 2005] The frequency of each visual word is then recorded in a histogram which represents the ﬁnal feature vector for the image This histogram is the equivalent of the bag-of-words representation for text The idea of representing images as bags-of-visual-wordshas demonstrated impressive levels of performance for image categorization [Zhang et al , 2007], image retrieval [Philbin et al , 2007], facial expression recognition [Ionescu et al , 2013] and related tasks 8 2 2 Word Embeddings Because of the success of the bag-of-visual-words model in image classiﬁcation, we propose a similar approach on text, by replacing the local image descriptors with word embeddings Word embeddings are long known in the NLP com- munity [Bengio et al , 2003; Collobert & Weston, 2008], but they have recently become more popular due to the word2vec [Mikolov et al , 2013] framework that allows to eﬃciently build vector representations from words Word embeddings represent each word as a low-dimensional real-valued vector, such that seman- tically related words reside in close vicinity in the generated space Word em- 224 beddings are in fact a learned distributed representation of words where each dimension represents a latent feature of the word [Turian et al , 2010] Using the word representation induced by the embedding space, documents can be repre- sented as a set of word vectors, where the size of this set is given by the number of words in the document Given the fact that two documents are likely to be represented by sets of diﬀerent sizes, the comparison between the respective documents cannot be done directly To overcome this issue, Let et al [Le & Mikolov, 2014] proposed the Paragraph Vector, an unsupervised algorithm that learns ﬁxed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents Their algorithm represents each document by a dense vector which is trained to predict words in the document With some inspiration from computer vision, an alternative approach to solve the issue of variable-length representations is proposed by Clinchant et al [Clin- chant & Perronnin, 2013] Following the success of Fisher Vectors in computer vision [Perronnin et al , 2010], Clinchant et al [Clinchant & Perronnin, 2013] apply the Fisher Kernel framework [Jaakkola et al , 1999] to aggregate the word embeddings of a document in order to obtain a ﬁxed-length vector representation for the respective document We propose a diﬀerent approach that also draws its roots in computer vision research [Csurka et al , 2004; Leung & Malik, 2001; Philbin et al , 2007] Our approach employs the k-means clustering algorithm in order to group the word embeddings into a ﬁxed number of clusters accord- ing their semantic relatedness We regard the resulted cluster centroids as visual words and process them accordingly, in order to obtain a histogram representation for each document Word embeddings have also been used in information retrieval [Clinchant & Perronnin, 2013; Ye et al , 2016] and in word sense disambiguation [Bhingardive et al , 2015; Butnaru et al , 2017; Chen et al , 2014; Iacobacci et al , 2016] due to their ability of modeling syntactic and semantic information Another useful char- acteristic of word embeddings is that one can speciﬁcally train them to capture sentiment information in order to detect the polarity of documents [Dos Santos & Gatti, 2014; Le & Mikolov, 2014] 225 8 3 Bag-of-Super-Word-Embeddings In computer vision, the BOVW model can be applied to image classiﬁcation and related tasks, by treating image descriptors as words A bag-of-visual-words is a vector of occurrence counts of a vocabulary of local image features This repre- sentation can also be described as a histogram of visual words The vocabulary is usually obtained by vector quantizing image features into visual words Inspired by the BOVW model, we propose a similar way to process text doc- uments by leveraging the use of word embeddings In our approach designed for text, the image descriptors are replaced by word embeddings Knowing the fact that word embeddings carry semantic information by projecting semantically re- lated words in the same region of the embedding space, we propose to cluster word vectors in order to obtain relevant semantic clusters of words Each centroid of the newly formed clusters can be regarded as a super word vector that represents all the word vectors in a small region of the embedding space By putting the super word vectors together, we obtain a vocabulary that we subsequently use to describe each document as a histogram of super word embeddings We term this model bag-of-super-word-embeddings(BOSWE) [Butnaru & Ionescu, 2017] The BOSWE model can be divided in two major steps The ﬁrst step is to build a feature representation The second step is to train a kernel method in order to predict the class label of a new document Each of these two steps are independently carried out in two stages, one for training (usually done oﬄine) and one for testing (usually executed online) The entire process, that involves both training and testing stages, is illustrated in Figure 8 1 The feature representation step works as described next Features are rep- resented by the word vectors obtained by embedding all the words in the text Next, the word embeddings are vector quantized and a vocabulary of super word embeddings is obtained The vector quantization process is done by k-means clustering [Leung & Malik, 2001], and the formed centroids are stored in a ran- domized forest of k-d trees [Philbin et al , 2007] to reduce search cost Although alternative clustering approaches have been proposed in the computer vision liter- ature [Martinet, 2014; Sivic et al , 2005], k-means remains the most popular choice for the vector quantization step This is the main reason for using k-means in our 226 Figure 8 1: The BOSWE model for text classiﬁcation Words are embedded into a vector space and quantized into super word vectors The frequency of each super word vector is then recorded in a histogram The histograms enter the training stage Learning is done by a kernel method framework The resulted centroids can hold some high-level abstract deﬁnition of a concept Every word in the text is assigned to the closest centroid based on the Euclidean distance measure The frequency of each super word embedding is then computed and recorded in a histogram We propose two alternative pipelines for 227 building the feature representation In the ﬁrst pipeline, we process the entire collection of documents all at once in order to build a single vocabulary of su- per word vectors However, in this approach, words representing diﬀerent classes can be clustered together due their semantic relatedness The second pipeline aims to overcome this issue by grouping the training text documents into classes, and by processing each group of documents separately This leads to a set of class-speciﬁc vocabularies of super word embeddings Even if we know that a document belongs to a certain class at training time, we cannot rely on this as- sumption at test time Therefore, we have to build the feature representation of each document by concatenating all the histograms corresponding to the class- speciﬁc vocabularies For both pipelines, we then consider the feature vectors corresponding to the entire set of documents for the training step Typically, a kernel method is employed for training the model In computer vision, several kernels have be used at this stage The linear kernel, the intersection kernel, the Hellinger's kernel or the Jensen-Shannon (JS) kernel are typical choices from the literature [Vedaldi & Zisserman, 2010] Another option is the recently developed PQ kernel [Ionescu & Popescu, 2013, 2015b] The underlying idea of the PQ kernel is to treat the super word vector histograms as ordinal data, in which data is ordered but cannot be assumed to have equal distance between values In this case, a histogram will be regarded as a ranking of super word vectors according to their frequencies in that histogram Usage of the ranking of super word vectors instead of the actual values of the frequencies may seem as a loss of information, but the process of ranking can actually make the PQ kernel more robust, acting as a ﬁlter and eliminating the noise contained in the values of the frequencies As for object recognition [Ionescu & Popescu, 2013, 2015b] or texture classiﬁcation in images [Ionescu et al , 2014a], we show that this kernel can yield better results than the other kernels We try out the above mentioned kernel functions in combination with the Support Vector Machines (SVM) classiﬁer [Cortes & Vapnik, 1995; Shawe-Taylor & Cristianini, 2004] After our model is trained, it can be used to classify new documents Given a test document, features are extracted and quantized into centroids from the vocabulary (or the multiple class-speciﬁc vocabularies) that was (were) already obtained in the training stage The histogram of super word 228 embeddings that represents the test document can be compared with the his- tograms learned in the training stage The system can return either a label (or a score) for the test document or a ranked list of documents similar to the test document, depending on the application For text classiﬁcation a label (or a score) is enough, while for information retrieval a ranked list of documents is more appropriate No matter the application, the training stage of the BOSWE model can be done oﬄine For this reason, the time that is necessary for vector quantization and learning is not of great importance What matters most in the context of text classiﬁcation is to return the result for a new (test) document as quickly as possible The performance level of the described model depends on the number of train- ing documents, but also on the number of clusters The number of clustersk is a parameter of the model that must be set a priori In computer vision, there is a common practice to use larger vocabularies from improved performance [Ionescu & Popescu, 2015b; Ionescu et al , 2013], however, there is a point where the accu- racy saturates and the only eﬀect of further increasingk is to unnecessarily slow down the computation 8 3 1 Implementation Details We next provide some implementation details for our BOSWE model used through- out the experiments In the feature representation step, we have used the pre- trained word embeddings computed by the word2vec toolkit [Mikolov et al , 2013] on the Google News data set using the Skip-gram model The pre-trained model contains 300-dimensional vectors for 3 million words and phrases Most of the steps involved in the BOSWE model, such as the k-means clustering and the ran- domized forest of k-d trees, are implemented using the VLFeat library [Vedaldi & Fulkerson, 2008] After computing the histograms, we apply one of the following kernels: theL2-normalized linear kernel, theL1-normalized Hellinger's kernel, the L1-normalized intersection kernel, theL1-normalized Jensen-Shannon (JS) kernel, and theL2-normalized PQ kernel The norms are chosen according to Vedaldi et al [Vedaldi & Zisserman, 2010], who state that -homogeneous kernels should be L -normalized We use the software provided athttp://pq-kernel herokuapp 229 com to compute the PQ kernel It is important to mention that all these kernels are used in the dual form, that implies using thekernel trick[Shawe-Taylor & Cristianini, 2004] to directly build kernel matrices of pairwise similarities between samples In the learning stage, we use the dual implementation of the Support Vector Machines classiﬁer provided in LibSVM [Chang & Lin, 2011] 8 3 2 Combination with String Kernels In text mining, string kernels can be used to measure the pairwise similarity be- tween text samples, simply based on character n-grams Various string kernel functions have been proposed to date [Ionescu et al , 2014b; Lodhi et al , 2002; Shawe-Taylor & Cristianini, 2004] One of the most recent string kernel is the his- togram intersection string kernel(HISK) [Ionescu et al , 2014b] For two strings over an alphabet , x; y2 , the intersection string kernel is formally deﬁned as follows: X k\(x; y) =minfnumv(x); numv(y)g; (8 1) v2n where numv(x) is the number of occurrences of n-gram v as a substring in x, and n is the length of v In our AES experiments, we use the intersection string kernel based on a range of character n-grams We approach AES as a regression task, and employ -Support Vector Regression (-SVR) [Chang & Lin, 2002] for training We combine HISK and BOSWE in the dual (kernel) form, by simply sum- ming up the two corresponding kernel matrices Summing up kernel matrices is equivalent to feature vector concatenation in the primal space This means that the two approaches are fused before the learning stage As a consequence of kernel summation, the search space of linear patterns grows, which should help the kernel classiﬁer, in our case -SVR, to ﬁnd a better regression function 230 8 4 Polarity Classiﬁcation Experiments 8 4 1 Data Set The ﬁrst corpus used to evaluate the proposed model is the Movie Review data set [Pang et al , 2002] This is probably the most popular corpus used for senti- ment analysis The Movie Review data set consists of 2000 movie reviews taken from the IMDB movie review archives There are 1000 positive reviews consisting of four and ﬁve star reviews, and 1000 negative ones consisting of one and two star reviews We use a 10-fold cross-validation procedure in the evaluation 8 4 2 Baselines We compare our model against a baseline bag-of-words We considered the follow- ing steps to obtain a bag-of-words representation suited for the polarity catego- rization task First of all, the text is broken down into tokens After applying the tokenization process, the next step is to eliminate the stop words1 The remaining terms from the entire collection of documents are gathered into a vocabulary The frequency of each term is then computed on a per document basis The frequency histograms are normalized using the L2-norm As in our own approach, we use SVM for training We also consider the approach of Pang et al [Pang et al , 2002], an alternative implementation of the bag-of-words model, as baseline 8 4 3 Results Table 8 1 presents the accuracy rates of various BOSWE models obtained in a 10-fold cross-validation procedure carried out on the Movie Review data set, by combining diﬀerent vocabulary dimensions and kernels The results presented in Table 8 1 indicate that building a vocabulary for each polarity class (positive and negative) is a better approach than building a single vocabulary for the entire training set This observation holds for every kernel considered in the evaluation Interestingly, among the evaluated kernels, we obtain better performance with 1 Stop words are the most common words in a language, usually function words, such as this, is, it 231 Table 8 1: Accuracy rates using 10-fold cross-validation on the Movie Review data set with diﬀerent kernels and vocabulary dimensions The best accuracy rate for each vocabulary dimension is highlighted in bold VocabularyLinearHellinger'sIntersectionJSPQ L2-normL1-normL1-normL1-normL2-norm 1 5000 words84:80%86:15%85:40%85:80%86:55% 1 10000 words85:05%86:45%85:75%86:10%87:15% 2 5000 words85:75%87:60%86:95%87:35%88:25% 2 7500 words87:15%88:60%88:15%87:80%88:95% Table 8 2: Accuracy rates using 10-fold cross-validation on the Movie Review data set with various BOSWE conﬁgurations versus two baseline approaches The best accuracy rate is highlighted in bold MethodAccuracy Baseline BOW84:10% Pang et al [Pang et al , 2002]82:90% BOSWE (2 7500 words and Hellinger's kernel)88:60% BOSWE (2 7500 words and PQ kernel)88:95% BOSWE (2 7500 words and Hellinger's kernel + PQ kernel)89:65% the Hellinger's and the PQ kernels For every vocabulary dimension, PQ kernel always yields the best results The best performance (88:95%) is obtained when the BOSWE model relies on two vocabularies, each of 7500 super word vectors, and on the PQ kernel Remarkably, these results are somewhat consistent to the results reported in [Ionescu & Popescu, 2013, 2015b] in the context of ob- ject recognition from images Indeed, previous works [Ionescu & Popescu, 2013, 2015b] have also found that using more visual words and applying the PQ kernel leads to better performance We compare our best BOSWE conﬁgurations with two baseline approaches in Table 8 2 We also try to combine the Hellinger's and the PQ kernels by summing them up, in order to improve the performance Nevertheless, the results indicate that all our BOSWE conﬁgurations achieve better performance than the baseline 232 approaches The best BOSWE conﬁguration yields an accuracy of 89:65% Our best approach is more 5% better than baseline BOW and more than 6% better than the baseline approach of Pang et al [Pang et al , 2002] We thus conclude that the BOSWE model is capable to improve the performance over a standard BOW model for the polarity classiﬁcation task 8 5 Text Categorization Experiments 8 5 1 Data Set The Reuters-21578 corpus [Lewis, 1997] is one of the most widely used test col- lections for text categorization research It contains 21578 articles collected from Reuters newswire Following the procedure of Joachims et al [Joachims, 1998] and Yang et al [Yang & Liu, 1999], the categories that have at least one document in the training set and one in the test set are selected This leads to a total of 90 categories We use the ModeApte evaluation [Xue & Zhou, 2009], in which unla- beled documents are eliminated After removing the unlabeled documents, there are 10787 documents left that belong to 90 categories Each document belongs to one or more categories and the average number of categories per document is 1:2 The collection is split into 7768 documents in the training set and 3019 documents in test set 8 5 2 Baseline We compare our BOSWE model with a bag-of-words baseline adapted speciﬁ- cally to text categorization by topic The following steps are required to obtain a bag-of-words representation suited for the text categorization task The text is ﬁrst broken down into tokens After tokenization, the following step is to elimi- nate the stop words, as they do not provide useful information in the context of text categorization by topic The remaining words are stemmed using the Porter stemmer [Porter, 1980] algorithm1 This algorithm removes the commoner mor- phological and inﬂexional endings from words in English The resulted terms from 1 Stemming is the process that reduces a word to its root form 233 Table 8 3: Confusion matrix of a binary classiﬁer with labels +1 or1 There are four distinct groups of samples illustrated here: true positive (T P ), false positive (F P ), false negative (F N), and true negative (T N) Expert judgments Labels+11 Classiﬁer+1T PF P predictions1F NT N the entire collection of documents are collected into a vocabulary The frequency of each term is then computed on a per document basis Let ft;ddenote the raw frequency of a term t in a document d, namely the number of times t occurs in d The bag-of-words representation used as baseline in the following experiments is obtained by computing the log-normalized term frequencyas follows: ( 1 +logft;d;if ft;d> 0 tf(t; d) =:(8 2) 0;if ft;d= 0 8 5 3 Evaluation Procedure To evaluate and compare the text categorization approaches, the precision and the recall are ﬁrst computed based on the confusion matrix presented in Table 8 3 Theprecisionis given by the number of true positive documents (T P) divided by the number of documents predicted as positive by the classiﬁer (T P + F P ), while the recallis given by the number of true positive documents (T P ) divided by the total number of documents marked as positive by a trusted expert judge (T P+ F N) To capture the precision and recall into a single representative number, theF1measure can be employed TheF1measure can be interpreted as a weighted average of the precision and recall given by: F1= 2precisionrecall: precision + recall For each category, a binary classiﬁer is trained to predict the positive and negative labels for the test documents However, the performance of the classiﬁer 234 Table 8 4: Results on the Reuters-21578 test set with diﬀerent kernels and vo- cabulary dimensions The bestmicroF1andmacroF1scores for each vocabulary dimension are highlighted in bold VocabularyLinearHellinger'sIntersectionJSPQ L2-normL1-normL1-normL1-normL2-norm microF1 1 10000 words86:62%86:56%85:28%86:30%86:74% 1 20000 words86:72%86:61%85:66%86:35%86:80% 90 100 words86:77%86:91%86:25%86:59%86:84% 90 200 words86:83%87:04%86:33%86:74%87:07% macroF1 1 10000 words49:42%45:21%41:19%43:30%49:31% 1 20000 words49:58%45:39%41:55%43:54%49:36% 90 100 words49:63%47:71%42:50%44:94%49:49% 90 200 words49:68%47:75%42:64%45:06%49:51% needs to be evaluated at the global level (over all categories) Two approaches are used in literature to aggregate theF1measures over multiple categories One is based on computing a confusion matrix for each category, which can be used to subsequently calculate the F1measure for each category Finally, the global F1 measure is obtained by averaging all theF1measures This ﬁrst measure is known as macro-averaged F1(macroF1) The other approach is based on computing a global confusion matrix for all the categories by summing the documents that fall in each of the four conditioned sets, namely true positives, true negatives, false positives, and false negatives The global F1measure is immediately computed with the values provided by the global confusion matrix This second measure is known as micro-averaged F1(microF1) As noted by Xue et al [Xue & Zhou, 2009], the classiﬁer's performance on rare categories has more impact on the macro-averaged F1measure, while the performance on common categories has more impact on the micro-averaged F1measure Thus, it makes sense to report both these measures in the following experiments 235 Table 8 5: Results on the Reuters-21578 test set with various BOSWE conﬁgu- rations versus a baseline bag-of-words model The best microF1and macroF1 scores are highlighted in bold MethodmicroF1macroF1 Baseline BOW86:09%49:45% BOSWE (90 200 words and linear kernel)86:83%49:68% BOSWE (90 200 words and PQ kernel)87:07%49:51% BOSWE (90 200 words and linear kernel + PQ kernel)87:24%49:72% 8 5 4 Results Table 8 4 presents the micro-averagedF1scores and macro-averagedF1scores of various BOSWE models obtained on the Reuters-21578 test set, by combining diﬀerent vocabulary dimensions and kernels The results presented in Table 8 4 indicate that building a vocabulary for each topic gives slightly better results than building a single vocabulary for all the 90 topics, even though the topic-speciﬁc vocabularies are signiﬁcantly smaller in size, e g 200 words versus 20000 words Among the evaluated kernels, we obtain better performance with the linear and the PQ kernels While the PQ kernel yields a better microF1score, the linear kernel compensates with a better macroF1score Nonetheless, the diﬀerence between the two kernels is not signiﬁcant We compare our best BOSWE conﬁgurations with two baseline approaches in Table 8 5 We again try to combine best performing kernels by summing them up Although the results indicate that our BOSWE conﬁgurations achieve better performance than the baseline bag-of-words, the diﬀerences are not as high as in the polarity classiﬁcation experiments Our best BOSWE conﬁguration yields a microF1score of 87:24% and a macroF1score of 49:72%, which represents an improvement of 1:15% in terms of microF1and 0:27% in terms of macroF1 over the baseline Overall, it seems that the BOSWE model can surpass the performance of a standard bag-of-words representation for text categorization by topic 236 Table 8 6: The number of essays and the score ranges for the 8 diﬀerent prompts in the Automated Student Assessment Prize (ASAP) data set PromptNumber of EssaysScore Range 117832-12 218001-6 317260-3 417260-3 517720-4 618050-4 715690-30 87230-60 8 6 Automatic Essay Scoring Experiments 8 6 1 Data Set To evaluate our approach, we use the Automated Student Assessment Prize (ASAP) data set from Kaggle The ASAP data set contains 8 prompts of dif- ferent genres The number of essays per prompt along with the score ranges are presented in Table 8 6 Since the oﬃcial test data of the ASAP competition is not released to the public, we, as well as others before us [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018], use only the training data in our experiments 8 6 2 Evaluation Procedure As in [Dong & Zhang, 2016], we scaled the essay scores into the range 0-1 We closely followed the same settings for data preparation as [Dong & Zhang, 2016; Phandi et al , 2015] For the in-domain experiments, we use 5-fold cross- validation The 5-fold cross-validation procedure is repeated for 10 times and the results were averaged to reduce the accuracy variation introduced by randomly selecting the folds We note that the standard deviation in all cases in below 0:2% For the cross-domain experiments, we use the same source!target domain 237 pairs as [Dong & Zhang, 2016; Phandi et al , 2015], namely, 1!2, 3!4, 5!6 and 7!8 All essays in the source domain are used as training data Target domain samples are randomly divided into 5 folds, where one fold is used as test data, and the other 4 folds are collected together to sub-sample target domain train data The sub-sample sizes are nt=f10; 25; 50; 100g The sub-sampling is repeated for 5 times as in [Dong & Zhang, 2016; Phandi et al , 2015] to reduce bias As our approach performs very well in the cross-domain setting, we also present experimentswithoutsub-sampling data from the target domain, i e when the sub-sample size is nt= 0 As evaluation metric, we use the quadratic weighted kappa (QWK) 8 6 3 Baselines We compare our approach with state-of-the-art methods based on handcrafted features [Phandi et al , 2015], as well as deep features [Dong & Zhang, 2016; Dong et al , 2017; Tay et al , 2018] We note that results for the cross-domain setting are reported only in some of these recent works [Dong & Zhang, 2016; Phandi et al , 2015] 8 6 4 Implementation Choices For the string kernels approach, we used the histogram intersection string ker- nel (HISK) based on the blended range of character n-grams from 1 to 15 To compute the intersection string kernel, we used the open-source code provided athttp://string-kernels herokuapp com For the BOSWE approach, we set the number of clusters (dimension of the vocabulary) to k = 500 After comput- ing the BOSWE representation, we apply the L1-normalized intersection kernel We combine HISK and BOSWE in the dual form by summing up the two corre- sponding matrices For the learning phase, we employ the dual implementation of -SVR available in LibSVM [Chang & Lin, 2011] We set its regularization parameter to c = 103and = 101in all our experiments 238 Table 8 7: In-domain automatic essay scoring results of our approach versus several state-of-the-art methods [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018] Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation The best QWK score (among the machine learning systems) for each prompt is highlighted in bold Method12345678Overall Human0:7210:8140:7690:8510:7530:7760:7210:6290:754 [Phandi et al , 2015]0:7610:6060:6210:7420:7840:7750:7300:6170:705 [Dong & Zhang, 2016] 0:734 [Dong et al , 2017]0:8220:6820:6720:8140:8030:8110:8010:7050:764 [Tay et al , 2018]0:8320:6840:6950:7880:8150:8100:8000:6970:764 HISK0:8360:7240:6770:8210:8300:8280:8010:7260:780 BOSWE0:7880:6890:6670:8090:8240:8240:7660:6790:756 HISK+BOSWE0:8450:7290:6840:8290:8330:8300:8040:7290:785 8 6 5 In-Domain Results The results for the in-domain automatic essay scoring task are presented in Ta- ble 8 7 In our empirical study, we also include feature ablation results We report the QWK measure on each prompt as well as the overall average We ﬁrst note that the histogram intersection string kernel alone reaches better overall perfor- mance (0:780) than all previous works [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018] Remarkably, the overall performance of the HISK is also higher than the inter-human agreement (0:754) Although the BOSWE model can be regarded as a shallow approach, its overall results are com- parable to those of deep learning approaches [Dong & Zhang, 2016; Dong et al , 2017; Tay et al , 2018] When we combine the two models (HISK and BOSWE), we obtain even better results Indeed, the combination of string kernels and word embeddings attains the best performance on 7 out of 8 prompts The average QWK score of HISK and BOSWE (0:785) is more than 2% better the average scores of the best-performing state-of-the-art approaches [Dong et al , 2017; Tay et al , 2018] 239 Table 8 8: Corss-domain automatic essay scoring results of our approach versus two state-of-the-art methods [Dong & Zhang, 2016; Phandi et al , 2015] Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as [Dong & Zhang, 2016; Phandi et al , 2015] The best QWK scores for each source!target domain pair are highlighted in bold Source!TargetMethodnt= 0nt= 10nt= 25nt= 50nt= 100 1!2[Phandi et al , 2015]0:4340:4630:4570:4920:510 [Dong & Zhang, 2016]-0:5460:5690:5630:559 HISK0:4400:5860:6370:6520:657 BOSWE0:3980:4740:4780:4920:506 HISK+BOSWE0:5420:5840:6320:6570:661 3!4[Phandi et al , 2015]0:5220:5930:6090:6180:646 [Dong & Zhang, 2016]-0:6280:6560:6590:662 HISK0:7030:7160:7240:7420:751 BOSWE0:6150:6400:7160:7280:727 HISK+BOSWE0:7010:7130:7370:7540:779 5!6[Phandi et al , 2015]0:1870:5390:6620:6800:713 [Dong & Zhang, 2016]-0:6470:7000:7140:750 HISK0:7150:7260:7540:7570:781 BOSWE0:6170:6230:6440:6500:692 HISK+BOSWE0:7280:7340:7640:7710:788 7!8[Phandi et al , 2015]0:1710:5860:6070:6130:621 [Dong & Zhang, 2016]-0:5700:5900:5680:587 HISK0:4860:6040:6170:6260:639 BOSWE0:4190:5260:5770:5820:591 HISK+BOSWE0:5220:6060:6370:6380:649 8 6 6 Cross-Domain Results The results for the cross-domain automatic essay scoring task are presented in Table 8 8 For each and every source!target pair, we report better results than both state-of-the-art methods [Dong & Zhang, 2016; Phandi et al , 2015] We ob- serve that the diﬀerence between our best QWK scores and the other approaches are sometimes much higher in the cross-domain setting than in the in-domain setting We particularly notice that the diﬀerence from [Phandi et al , 2015] when nt= 0 is always higher than 10% Our highest improvement (more than 54%, from 0:187 to 0:728) over [Phandi et al , 2015] is recorded for the pair 5!6, 240 when nt= 0 Our score in this case (0:728) is even higher than both scores of [Dong & Zhang, 2016; Phandi et al , 2015] when they use nt= 50 Diﬀerent from the in-domain setting, we note that the combination of string kernels and word embeddings does not always provide better results than string kernels alone, particularly when the number of target samples (nt) added into the training set is less or equal to 25 8 7 Discussion In this paper, we have presented an approach for building an eﬀective feature representation for various text classiﬁcation tasks The proposed approach is based on clustering word embeddings using k-means and on representing a text document as a bag-of-super-word-embeddings, in a similar fashion to the bag-of- visual-wordsmodel, which is broadly used in computer vision for representing images The empirical results on polarity classiﬁcation and text categorization by topic demonstrate that our approach is able to surpass the classicalbag-of- words approach Moreover, the automatic essay scoring experiments indicate that BOSWE in combination with string kernels attain the best performance on the automatic essay scoring task Using a shallow approach, we report better results compared to recent deep learning approaches [Dong & Zhang, 2016; Dong et al , 2017; Tay et al , 2018] Some researchers [Martinet, 2014] have questioned the suitability of the k- means algorithm for the vector quantization of visual words, as the generated clusters (visual words) do not follow Zipf's law, although words in natural lan- guage do follow it In future work, we aim to replace the k-means clustering approach with alternative approaches, such as density-based clustering or self- organizing maps In case the words projected in the embedding space are not uniformly distributed, it would be more appropriate to employ a clustering al- gorithm that is able to capture the distribution of the embedded words into a vocabulary that follows Zipf's law According to Martinet [Martinet, 2014], this can lead to more accurate results Another direction that is worth looking into in future work is to include spatial information into the BOSWE model Spatial information [Ionescu & Popescu, 2015a] has already been shown to improve the 241 REFERENCES performance of the bag-of-visual-words model for object recognition in Chapter 3 and the performance of the bag-of-words model for text categorization by topic in Chapter 7 This seems to be a promising way to further improve the accuracy of the bag-of-super-word-embeddings References Agirre, Eneko and Edmonds, Philip Glenny Word Sense Disambiguation: Algo- rithms and Applications Springer, 2006 (cited on 225) Alikaniotis, Dimitrios, Yannakoudakis, Helen, and Rei, Marek Automatic text scoring using neural networks In Proceedings of ACL, pp 715{725, 2016 (cited on 224) Attali, Yigal and Burstein, Jill Automated essay scoring with e-rater v 2 0 Jour- nal of Technology, Learning, and Assessment, 4(3):1{30, 2006 (cited on 224) Bay, Herbert, Ess, Andreas, Tuytelaars, Tinne, and Gool, Luc Van Speeded-Up Robust Features (SURF) Computer Vision and Image Understanding, 110(3): 346{359, June 2008 (cited on 222, 223, 225) Bengio, Yoshua Learning deep architectures for AI Foundations and Trends in Machine Learning, 2(1):1{127, 2009 (cited on 222) Bengio, Yoshua, Ducharme, Rejean, Vincent, Pascal, and Janvin, Christian A Neural Probabilistic Language Model Journal of Machine Learning Research, 3:1137{1155, March 2003 (cited on 222, 225) Bhingardive, Sudha, Singh, Dhirendra, V, Rudramurthy, Redkar, Hanu- mant Harichandra, and Bhattacharyya, Pushpak Unsupervised Most Frequent Sense Detection using Word Embeddings In Proceedings of NAACL, pp 1238{ 1243 The Association for Computational Linguistics, 2015 (cited on 222, 226) Butnaru, Andrei and Ionescu, Radu Tudor From Image to Text Classiﬁcation: A Novel Approach based on Clustering Word Embeddings In Proceedings of KES, pp 1784{1793, 2017 (cited on 223, 227) 242 REFERENCES Butnaru, Andrei, Ionescu, Radu Tudor, and Hristea, Florentina ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing In Proceedings of EACL, pp 916{926, 2017 (cited on 222, 226) Chang, Chih-Chung and Lin, Chih-Jen Training -Support Vector Regression: Theory and Algorithms Neural Computation, 14:1959{1977, 2002 (cited on 231) Chang, Chih-Chung and Lin, Chih-Jen LibSVM: A Library for Support Vector Machines ACM Transactions on Intelligent Systems and Technology, 2:27:1{ 27:27, 2011 Software available athttp://www csie ntu edu tw/~cjlin/ libsvm (cited on 230, 239) Chen, Hongbo and He, Ben Automated essay scoring by maximizing human- machine agreement InProceedings of EMNLP, pp 1741{1752, 2013 (cited on 224) Chen, Xinxiong, Liu, Zhiyuan, and Sun, Maosong A Uniﬁed Model for Word Sense Representation and Disambiguation InProceedings of EMNLP, pp 1025{1035, Doha, Qatar, October 2014 Association for Computational Lin- guistics (cited on 222, 226) Clinchant, Stephane and Perronnin, Florent Aggregating continuous word em- beddings for information retrieval InProceedings of CVSC Workshop, pp 100{109, 2013 (cited on 222, 226) Collobert, Ronan and Weston, Jason A Uniﬁed Architecture for Natural Lan- guage Processing: Deep Neural Networks with Multitask Learning In Proceed- ings of ICML, pp 160{167, New York, NY, USA, 2008 ACM (cited on 222, 225) Cortes, Corinna and Vapnik, Vladimir Support-Vector Networks Machine Learning, 20(3):273{297, 1995 (cited on 229) 243 REFERENCES Cozma, Madalina, Butnaru, Andrei, and Ionescu, Radu Tudor Automated essay scoring with string kernels and word embeddings In Proceedings of ACL, pp 503{509, 2018 (cited on 224) Csurka, Gabriella, Dance, Christopher R , Fan, Lixin, Willamowski, Jutta, and Bray, Cdric Visual categorization with bags of keypoints InProceedings of Workshop on Statistical Learning in Computer Vision at ECCV, pp 1{22, 2004 (cited on 222, 226) Dalal, Navneet and Triggs, Bill Histograms of Oriented Gradients for Human Detection In Proceedings of CVPR, volume 1, pp 886{893, Washington, DC, USA, 2005 IEEE Computer Society (cited on 222) Dikli, Semire An Overview of Automated Scoring of Essays Journal of Tech- nology, Learning, and Assessment, 5(1):1{35, 2006 (cited on 224) Dong, Fei and Zhang, Yue Automatic Features for Essay Scoring { An Empirical Study InProceedings of EMNLP, pp 1072{1077, 2016 (cited on 224, 238, 239, 240, 241, 242, 305) Dong, Fei, Zhang, Yue, and Yang, Jie Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring In Proceedings of CONLL, pp 153{162, 2017 (cited on 224, 238, 239, 240, 242, 305) Dos Santos, Ccero Nogueira and Gatti, Maira Deep Convolutional Neural Net- works for Sentiment Analysis of Short Texts In Proceedings of COLING, pp 69{78, 2014 (cited on 222, 226) Foltz, Peter W , Laham, Darrell, and Landauer, Thomas K Automated essay scoring: Applications to educational technology InProceedings of EdMedia, pp 40{64, 1999 (cited on 224) Gimenez-Perez, Rosa M , Franco-Salvador, Marc, and Rosso, Paolo Single and Cross-domain Polarity Classiﬁcation using String Kernels InProceedings of EACL, pp 558{563, April 2017 (cited on 224) 244 REFERENCES Iacobacci, Ignacio, Pilehvar, Mohammad Taher, and Navigli, Roberto Embed- dings for Word Sense Disambiguation: An Evaluation Study InProceedings of ACL, pp 897{907, August 2016 (cited on 222, 226) Ionescu, Radu Tudor and Butnaru, Andrei Learning to Identify Arabic and German Dialects using Multiple Kernels In Proceedings of VarDial Workshop of EACL, pp 200{209, 2017 (cited on 224) Ionescu, Radu Tudor and Popescu, Marius Kernels for Visual Words Histograms InProceedings of ICIAP, volume 8156, pp 81{90, Heidelberg, 2013 LNCS Springer-Verlag (cited on 223, 229, 233) Ionescu, Radu Tudor and Popescu, Marius Have a SNAK Encoding Spatial Information with the Spatial Non-alignment Kernel In Proceedings of ICIAP, volume 9279, pp 97{108 Springer LNCS, 2015a (cited on 242) Ionescu, Radu Tudor and Popescu, Marius PQ kernel: a rank correlation kernel for visual word histograms Pattern Recognition Letters, 55:51{57, 2015b (cited on 223, 229, 230, 233) Ionescu, Radu Tudor and Popescu, Marius UnibucKernel: An Approach for Arabic Dialect Identiﬁcation based on Multiple String Kernels In Proceedings of VarDial Workshop of COLING, pp 135{144, 2016 (cited on 224) Ionescu, Radu Tudor, Popescu, Marius, and Grozea, Cristian Local Learning to Improve Bag of Visual Words Model for Facial Expression Recognition In Proceedings of ICML Workshop on Challenges in Representation Learning, 2013 (cited on 225, 230) Ionescu, Radu Tudor, Popescu, Andreea Lavinia, and Popescu, Marius Texture Classiﬁcation with the PQ Kernel InProceedings of WSCG, pp 111{118, 2014a (cited on 229) Ionescu, Radu Tudor, Popescu, Marius, and Cahill, Aoife Can characters reveal your native language? A language-independent approach to native language identiﬁcation In Proceedings of EMNLP, pp 1363{1373 Association for Com- putational Linguistics, October 2014b (cited on 224, 231) 245 REFERENCES Ionescu, Radu Tudor, Popescu, Marius, and Cahill, Aoife String kernels for na- tive language identiﬁcation: Insights from behind the curtains Computational Linguistics, 42(3):491{525, 2016 (cited on 224) Jaakkola, Tommi S, Haussler, David, et al Exploiting generative models in discriminative classiﬁers InProceedings of NIPS, pp 487{493 MIT, 1999 (cited on 226) Joachims, Thorsten Text Categorization with Suport Vector Machines: Learning with Many Relevant Features In Proceedings of ECML, pp 137{142, London, UK, UK, 1998 Springer-Verlag (cited on 222, 234) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoﬀrey E ImageNet Classiﬁca- tion with Deep Convolutional Neural Networks InProceedings of NIPS, pp 1106{1114, 2012 (cited on 222) Larkey, Leah S Automatic essay grading using text categorization techniques In Proceedings of SIGIR, pp 90{95, 1998 (cited on 224) Le, Quoc and Mikolov, Tomas Distributed Representations of Sentences and Documents In Jebara, Tony and Xing, Eric P (eds ),Proceedings of ICML, pp 1188{1196 JMLR Workshop and Conference Proceedings, 2014 (cited on 226) Leung, Thomas and Malik, Jitendra Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons International Jour- nal of Computer Vision, 43(1):29{44, June 2001 (cited on 225, 226, 227) Lewis, David The reuters-21578 text categorization test collection http://www daviddlewis com/resources/testcollections/reuters21578/, 1997 (cited on 234) Lodhi, Huma, Saunders, Craig, Shawe-Taylor, John, Cristianini, Nello, and Watkins, Christopher J C H Text Classiﬁcation using String Kernels Journal of Machine Learning Research, 2:419{444, 2002 (cited on 231) 246 REFERENCES Lowe, David G Object Recognition from Local Scale-Invariant Features In Proceedings of ICCV, volume 2, pp 1150{1157, Washington, DC, USA, 1999 IEEE Computer Society (cited on 222, 223, 225) Lowe, David G Distinctive Image Features from Scale-Invariant Keypoints In- ternational Journal of Computer Vision, 60(2):91{110, November 2004 (cited on 222, 223, 225) Manning, Christopher D and Schutze, Hinrich Foundations of Statistical Natural Language Processing MIT Press, Cambridge, MA, USA, 1999 (cited on 225) Manning, Christopher D , Raghavan, Prabhakar, and Schutze, Hinrich Intro- duction to Information Retrieval Cambridge University Press, New York, NY, USA, 2008 (cited on 222, 225) Martinet, Jean From text vocabularies to visual vocabularies what basis? In Proceedings of VISAPP, volume 2, pp 668{675, 2014 (cited on 227, 242) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gregory S , and Dean, Jeﬀrey Distributed Representations of Words and Phrases and their Compo- sitionality In Proceedings of NIPS, pp 3111{3119, 2013 (cited on 222, 225, 230) Montavon, Gregoire, Orr, Genevieve B , and Muller, Klaus-Robert (eds ) Neural Networks: Tricks of the Trade, volume 7700 ofLecture Notes in Computer Science (LNCS) Springer, 2nd edition, 2012 (cited on 222) Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar Thumbs Up? Sentiment Classiﬁcation Using Machine Learning Techniques In Proceedings of EMNLP, volume 10, pp 79{86, Stroudsburg, PA, USA, 2002 Association for Computa- tional Linguistics (cited on 222, 231, 232, 233) Perronnin, Florent, Sanchez, Jorge, and Mensink, Thomas Improving the ﬁsher kernel for large-scale image classiﬁcation In Proceedings of ECCV, pp 143{156, Berlin, Heidelberg, 2010 Springer-Verlag (cited on 226) 247 REFERENCES Phandi, Peter, Chai, Kian Ming A , and Ng, Hwee Tou Flexible Domain Adap- tation for Automated Essay Scoring Using Correlated Linear Regression In Proceedings of EMNLP, pp 431{439, 2015 (cited on 224, 238, 239, 240, 241, 305) Philbin, James, Chum, Ondrej, Isard, Michael, Sivic, Josef, and Zisserman, An- drew Object retrieval with large vocabularies and fast spatial matching In Proceedings of CVPR, pp 1{8, 2007 (cited on 225, 226, 227) Popescu, Marius and Grozea, Cristian Kernel methods and string kernels for authorship analysis In Forner, Pamela, Karlgren, Jussi, and Womser-Hacker, Christa (eds ),CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, September 2012 (cited on 224) Popescu, Marius, Grozea, Cristian, and Ionescu, Radu Tudor HASKER: An eﬃcient algorithm for string kernels Application to polarity classiﬁcation in various languages In Proceedings of KES, pp 1755{1763, 2017 (cited on 224) Porter, Martin F An algorithm for suﬃx stripping Program, 14(3):130{137, 1980 (cited on 234) Shawe-Taylor, John and Cristianini, Nello Kernel Methods for Pattern Analysis Cambridge University Press, 2004 (cited on 229, 230, 231) Sivic, Josef, Russell, Bryan C , Efros, Alexei A , Zisserman, Andrew, and Free- man, William T Discovering Objects and their Localization in Images In Pro- ceedings of ICCV, pp 370{377 IEEE Computer Society, 2005 (cited on 222, 225, 227) Somasundaran, Swapna, Burstein, Jill, and Chodorow, Martin Lexical Chaining for Measuring Discourse Coherence Quality in Test-taker Essays In Proceedings of COLING, pp 950{961, 2014 (cited on 224) Song, Wei, Wang, Dong, Fu, Ruiji, Liu, Lizhen, Liu, Ting, and Hu, Guoping Discourse Mode Identiﬁcation in Essays In Proceedings of ACL, pp 112{122, 2017 (cited on 224) 248 REFERENCES Taghipour, Kaveh and Ng, Hwee Tou A neural approach to automated essay scoring In Proceedings of EMNLP, pp 1882{1891, 2016 (cited on 224) Tay, Yi, Phan, Minh C , Tuan, Luu Anh, and Hui, Siu Cheung SkipFlow: Incor- porating Neural Coherence Features for End-to-End Automatic Text Scoring In Proceedings of AAAI, pp 1{8, 2018 (cited on 224, 238, 239, 240, 242, 305) Turian, Joseph, Ratinov, Lev, and Bengio, Yoshua Word representations: a simple and general method for semi-supervised learning InProceedings of ACL, pp 384{394 Association for Computational Linguistics, 2010 (cited on 226) Vedaldi, Andrea and Fulkerson, B VLFeat: An Open and Portable Library of Computer Vision Algorithms http://www vlfeat org/, 2008 (cited on 230) Vedaldi, Andrea and Zisserman, Andrew Eﬃcient additive kernels via explicit feature maps InProceedings of CVPR, pp 3539{3546, San Francisco, CA, USA, 2010 IEEE Computer Society (cited on 229, 230) Wang, Jinhao and Brown, Michelle Stallone Automated essay scoring versus human scoring: A correlational study Contemporary Issues in Technology and Teacher Education, 8(4):310{325, 2008 (cited on 224) Xue, Xiao-Bing and Zhou, Zhi-Hua Distributional features for text categoriza- tion IEEE Transactions on Knowledge and Data Engineering, 21(3):428{442, March 2009 (cited on 234, 236) Yang, Yiming and Liu, Xin A re-examination of text categorization methods In Proceedings of SIGIR, pp 42{49, New York, NY, USA, 1999 ACM (cited on 234) Yannakoudakis, Helen, Briscoe, Ted, and Medlock, Ben A New Dataset and Method for Automatically Grading ESOL Texts In Proceedings of ACL, pp 180{189, 2014 (cited on 224) Ye, Xin, Shen, Hui, Ma, Xiao, Bunescu, Razvan, and Liu, Chang From word embeddings to document similarities for improved information retrieval in soft- 249 REFERENCES ware engineering InProceedings of ICSE, pp 404{415, 2016 (cited on 222, 226) Zhang, Jian, Marszalek, Marcin, Lazebnik, Svetlana, and Schmid, Cordelia Local Features and Kernels for Classiﬁcation of Texture and Object Categories: A Comprehensive Study International Journal of Computer Vision, 73(2):213{ 238, June 2007 (cited on 222, 225) 250 Chapter 9 Word Sense Disambiguation using ShotgunWSD Abstract ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD) The algorithm is inspired by the Shotgun se- quencing technique, which is a broadly-used whole genome sequencing approach ShotgunWSD performs WSD at the document level based on three phases The ﬁrst phase consists of applying a brute-force WSD algorithm on short context windows selected from the document in order to generate a short list of likely sense conﬁgurations for each window The second phase consists of assembling the local sense conﬁgurations into longer composite conﬁgurations by preﬁx and suﬃx matching In the third phase, the resulted conﬁgurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top conﬁgurations in which the respective word appears In this chapter, we present an improved version (2 0) of ShotgunWSD which is based on a diﬀerent approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense conﬁgura- tions For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag We embed the collected words from all the sense bags in the entire document into a vector space using a common 251 word embedding framework The word vectors are then clustered using k-means to form clusters of semantically related words At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether Words from the eliminated clusters are also removed from each and every sense bag Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embed- ding for the corresponding word sense We compare the improved ShotgunWSD algorithm (version 2 0) with its previous version (1 0) as well as several state- of-the-art unsupervised WSD algorithms We demonstrate that ShotgunWSD 2 0 yields better performance on four data sets: SemEval 2007, Senseval-2, Senseval- 3 and SemEval 2015 Furthermore, our algorithm outperforms the strong Most Common Sense (MCS) baseline on one data set, a remarkable achievement for an unsupervised technique 9 1 Introduction Word Sense Disambiguation (WSD) is a core problem studied in the Natural Language Processing (NLP) community WSD refers to the task of identifying which sense of a word is used in a given context It has the potential to improve many NLP applications such as machine translation [Carpuat & Wu, 2007], text summarization [Plaza et al , 2011], information retrieval [Chifu & Ionescu, 2012] or sentiment analysis [Sumanth & Inkpen, 2015] Most of the existing WSD algorithms [Agirre & Edmonds, 2006; Navigli, 2009] are usually divided into su- pervised, unsupervised, and knowledge-based techniques Nonetheless, hybrid approaches, for instance unsupervised and knowledge-based, have also been pro- posed in the literature [Hristea et al , 2008] Among these, supervised methods have reached the best disambiguation results [Iacobacci et al , 2016], but their main disadvantage is that they need large amounts of labeled examples for the supervised learning stage Since large annotated corpora are diﬃcult to obtain, many researchers have turned their focus on developping unsupervised learning approaches [Bhingardive et al , 2015; Chen et al , 2014; Schwab et al , 2012, 2013a,b] In this chapter, we present an improved version of a novel WSD algorithm [But- 252 naru et al , 2017], termed ShotgunWSD1, that stems from the Shotgun genome sequencing technique [Anderson, 1981; Istrail et al , 2004] Our WSD algorithm is also unsupervised, but it requires knowledge in the form of WordNet [Fellbaum, 1998; Miller, 1995] synsets as well as relations More precisely, for each sense of a word, we build adisambiguation vocabulary(orsense bag) that is an un- ordered list of words collected from the corresponding WordNet synset, gloss and related synsets which are chosen depending on the part-of-speech of the ambigu- ous word To this end, ShotgunWSD can be viewed as a hybrid (unsupervised and knowledge-based) approach In general, WSD algorithms can be divided into approaches that work at the local level and approaches that work at the global level A local WSD approach, such as the extended Lesk measure [Banerjee & Pedersen, 2002, 2003; Lesk, 1986], is designed to assign the corresponding sense for a target word in a given con- text window of a few words The corresponding sense is usually selected from an existing sense inventory For instance, for the word \sense" in the context \You have a good sense of humor ", a local WSD algorithm should choose the sense that corresponds to the natural abilityrather than the meaning of a word or the sensation Rather more generally, a global WSD algorithm aims to choose the appropriate sense for each ambiguous word in an entire text document The obvious solution for global WSD is the exhaustive evaluation of all sense combi- nations (conﬁgurations) [Patwardhan et al , 2003], but the time complexity grows exponentially along with the number of words in the document, as also noted in [Schwab et al , 2012, 2013a] For example, in the sentence \You have a good sense of humor ", we have four ambiguous WordNet [Miller, 1995] entries (considering that the part-of-speech of each word is already known): \have" (with 19 senses as verb), \good" (with 21 senses as adjective), \sense" (with 5 senses as noun) and \humor" (with 6 senses as noun) Consequently, there are 192156 = 11970 possible sense conﬁgurations However, if we extend the sentence to \I think that you have a good sense of humor ", we have two more ambiguous words: \I" (with 3 senses as noun) and \think" (with 13 senses as verb) In this case, the number of possible conﬁgurations grows to 3 13 19 21 5 6 = 466830 This 1 The open source implementation of ShotgunWSD is provided for free athttps://github com/butnaruandrei/ShotgunWSD 253 examples reveals that the brute-force (BF) solution quickly becomes impractical for windows of more than a few words Therefore, several approximation meth- ods [Schwab et al , 2012, 2013a] have been proposed for the global WSD task in order to overcome the exponential growth of the search space ShotgunWSD is conceived to perform global WSD by combining multiple local sense conﬁgura- tions that are obtained using BF search, thus avoiding BF search on the whole text document It employs a local WSD algorithm to build the local sense con- ﬁgurations Butnaru et al [Butnaru et al , 2017] alternatively used two methods for this step, namely the extended Lesk measure [Banerjee & Pedersen, 2002, 2003] and an approach based on deriving sense embeddings from word embed- dings [Bengio et al , 2003; Collobert & Weston, 2008; Mikolov et al , 2013] In this chapter, we describe a third approach which starts by embedding the words collected from all the sense bags in the entire document into a vector space using a common word embedding framework [Mikolov et al , 2013] The word vectors are then clustered using k-means to form clusters of semantically related words At this stage, we consider that clusters with fewer samples (with respect to a given threshold) belong to outlier (unlikely) senses and we eliminate these clusters from the subsequent steps Words from the eliminated clusters are also removed from each and every sense bag Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the cor- responding word sense Hence, the derived sense embedding will not take into the account outlier words and will reduce the chance of selecting unlikely word senses The main diﬀerence between ShotgunWSD 2 0 and its previous version presented in [Butnaru et al , 2017] consists of applying k-means clustering and eliminating smaller (outlier) clusters of semantically related words Our global WSD algorithm is comprised of three main phases In the ﬁrst phase, context windows of ﬁxed length are selected from the document, and for each context window the top scoring sense conﬁgurations constructed by BF search are kept for the upcoming phase The second phase consists of merging the retained sense conﬁgurations based on preﬁx and suﬃx matching Finally, the third phase consists of ranking the conﬁgurations obtained this far by their length (the longer, the better), and choosing the sense of each word through a majority vote on a short list of top conﬁgurations that cover the respective word 254 We have conducted experiments on SemEval 2007 [Navigli et al , 2007], Senseval- 2 [Edmonds & Cotton, 2001], Senseval-3 [Mihalcea et al , 2004] and SemEval 2015 [Moro & Navigli, 2015] data sets in order to compare ShotgunWSD 2 0 with its previous version [Butnaru et al , 2017] and other state-of-the-art unsupervised and knowledge-based approaches [Bhingardive et al , 2015; Chen et al , 2014; Manion, 2015; Schwab et al , 2013a], as well as the Most Common Sense (MCS) baseline1 MCS is considered as one of the strongest baselines in WSD [Agirre & Edmonds, 2006] The empirical results show that our algorithm compares favor- ably to these state-of-the-art approaches on all four data sets Furthermore, our algorithm outperforms the Most Common Sense (MCS) baseline on one data set, which is a remarkable achievement for an unsupervised technique We organize the rest of this chapter as follows We present related work on unsupervised and knowledge-based WSD algorithms in Section 9 2 We describe the ShotgunWSD algorithm in Section 9 3 We present the experiments in Sec- tion 9 4 Finally, we draw our conclusions in Section 9 5 9 2 Related Work Researchers have proposed a wide range of methods to perform WSD [Agirre & Edmonds, 2006; Navigli, 2009; Vidhu Bhala & Abirami, 2014] The most accu- rate techniques are supervised [Iacobacci et al , 2016], but they require annotated training corpora which are very diﬃcult to obtain In order to overcome this limitation, some researchers have proposed alternative WSD methods based on unsupervised learning or knowledgde bases [Agirre et al , 2014; Banerjee & Ped- ersen, 2002, 2003; Bhingardive et al , 2015; Chen et al , 2014; Nguyen & Ock, 2013; Panchenko et al , 2017; Schwab et al , 2012, 2013a,b] Since our algorithm is unsupervised and based on the WordNet [Fellbaum, 1998; Miller, 1995] knowl- edge base, our main focus is to present related work in the same area Banerjee et al [Banerjee & Pedersen, 2002] extend the gloss overlap algorithm of Lesk [Lesk, 1986] by using WordNet relations Patwardhan et al [Patwardhan et al , 2003] propose a brute-force algorithm for global WSD by employing the extended Lesk measure [Banerjee & Pedersen, 2002, 2003] to compute the semantic relatedness 1 Also known as the Most Frequent Sense baseline 255 among senses in a given text As discussed in Section 9 1, their BF approach is not suitable for whole text documents, because of the high computational time More recently, Schwab et al [Schwab et al , 2012] propose and compare three stochastic algorithms for global WSD as alternatives to BF search, namely a Genetic Algorithm, Simulated Annealing, and Ant Colony Optimization Among these algorithms, the authors have found that the Ant Colony Algorithm [Schwab et al , 2012, 2013a] yields better results than the other two Panchenko et al [Panchenko et al , 2017] propose an unsupervised and knowledge-free word sense induction and disambiguation approach that relies on induced inventories as a pivot for learning sense feature representations Recently, word embeddings have been employed in various works [Bhingar- dive et al , 2015; Chen et al , 2014; Iacobacci et al , 2016] to improve WSD results Word embeddings are long known in the NLP community [Bengio et al , 2003; Collobert & Weston, 2008], but they have recently become more popular due to the work of Mikolov et al [Mikolov et al , 2013] which introduced the word2vecframework that allows to eﬃciently build vector representations from words Chen et al [Chen et al , 2014] present a uniﬁed model for joint word sense representation and disambiguation They use the Skip-gram model to learn word vectors On the other hand, Bhingardive et al [Bhingardive et al , 2015] use pre-trained word vectors to build sense embeddings by averaging the word vectors produced for each sense of a target word As their goal is to ﬁnd an approximation for the MCS baseline, they consider the sense embedding that is closest to the embedding vector of the target word Iacobacci et al [Iacobacci et al , 2016] propose diﬀerent methods through which word embeddings can be leveraged in asupervisedWSD system architecture Interestingly, Iacobacci et al [Iacobacci et al , 2016] ﬁnd that a WSD method based on word embeddings alone can provide signiﬁcant performance improvements over a state-of-the-art WSD system that uses standard features for the WSD task 9 3 Method As illustrated in Section 9 1 and also noted by Schwab et al [Schwab et al , 2012], brute-force WSD algorithms based on semantic relatedness [Patwardhan 256 et al , 2003] are not practical for whole text disambiguation due to their expo- nential time complexity In this section, we describe a WSD algorithm that aims to avoid this computational issue Our algorithm is inspired by the Shotgun genome sequencing technique [Anderson, 1981] which is used in genetics research to obtain long DNA strands, a task known aswhole genome sequencing For instance, Istrail et al [Istrail et al , 2004] have used this technique to assemble the human genome Before a long DNA strand can be read, Shotgun sequencing needs to create multiple copies of the respective strand Next, DNA is randomly broken down into many small segments calledreads(usually between 30 and 400 nucleotides long), by adding a restriction enzyme into the chemical solution containing the DNA The reads can then be sequenced using Next-Generation Se- quencing techonlogy [Voelkerding et al , 2009], for example by using an Illumina (Solexa) machine [Bennett, 2004] In genome assembly, the low quality reads are usually eliminated [Patel & Jain, 2012] and the whole genome is reconstructed by assembling the remaining reads One strategy is to merge two or more reads in order to obtain longer DNA segments, if they have a signiﬁcant overlap of match- ing nucleotides Because of reading errors or mutations, the overlap is usually measured using a distance measure, for example the edit distance [Levenshtein, 1966] If a backbone DNA sequence is available, the reads are aligned to the backbone DNA before assembly, in order to ﬁnd their approximate position in the DNA that needs to be reconstructed We next present how we adapt the Shotgun sequencing technique for the task of global WSD We will make a few observations along the way that will lead to a simpliﬁed method, namelyShotgunWSD, which is formally presented in Algorithm 8 The three main phases of ShotgunWSD are also illustrated in Figure 9 1 We use the following notations in Algorithm 8 An array (or an ordered set of elements) is denoted byX = (x1; x2; ::::; xm) and the length ofX is denoted byjXj = m Arrays are considered to be indexed starting from position 1, thus X[i] =xi;8i2 f1; 2; :::mg Our goal is to ﬁnd a conﬁguration of senses G for the whole document D, that matches the ground-truth conﬁguration produced by human annotators A conﬁguration of senses is simply obtained by assigning a sense to each word in the text documentD In this work, the senses are selected from WordNet [Fellbaum, 257 Algorithm 8: ShotgunWSD Algorithm 1Input: 2 D = (w1; w2; :::; wm) { a document of m words denoted by wi; i2 f1; 2; :::; mg; 3 n { the length of the context windows (1 0); 5Initialization: 6 c 20; 7for i2 f1; 2; :::; mgdo 8Sw the set of WordNet synsets of wi; i 9S ;; 10 G (0; 0; ::::; 0), such thatjGj = m; 11Computation: 12for i2 f1; 2; :::; mn + 1gdo 13C ;; i 14while did not generate all sense conﬁgurationsdo 15C a new conﬁguration (sw; sw; :::; sw); sw2Sw;8j 2 fi; :::; i + n 1g; such ii+1i+n1jj that C =2C; i 16r 0; 17for p2 f1; 2; :::; n 1gdo 18for q2 fp + 1; 2; :::; ngdo jpqj1 19r r + 0:1n1 relatedness(C[p]; C[q]); 20C C[ f(C; i; n; r)g; ii 21C the top c conﬁgurations obtained by sorting the conﬁgurations inCby their relatedness ii score (descending); 22S S[C; i 23for l2 fminf5; n 1g; :::; 1gdo 24for p2 f1; 2; :::;jSjgdo 25(Cp; ip; np; rp) the p-th component ofS; 26for q2 f1; 2; :::;jSjgdo 27(Cq; iq; nq; rq) the q-th component ofS; 28if iqip< npand ip6=iqthen 29t true; 30for x2 f1; :::; lgdo 31if Cp[npl + x] 6=Cq[x] then 32t false; 33if t = truethen 34Cpq (Cp ; Cp ; :::; Cp[np]; Cq[l + 1]; Cq[l + 2]; :::; Cq[nq]); 35rpq rp; 36for i2 f1; 2; :::; np+nqlgdo 37for j 2 fl + 1; l + 2; :::; nqgdo jijj1 38rpq rpq+ 0:1n1 relatedness(Cpq[i]; Cq[j]); 39S S[ f(Cpq; ip; np+nql; rpq)g; 40for j 2 f1; 2; :::; mgdo 41Q f(C; i; d; r) j (C; i; d; r) 2S; j2 fi; i + 1; :::; i + d 1gg; j 42Q the top k conﬁgurations obtained by sorting the conﬁgurations inQby their length jj (descending); 43psw the predominant sense obtained by using a majority voting scheme onQ; jj 44G[j] psw; j 45Output: 46 G = (psw; psw; :::; psw); psw2Sw;8i2 f1; 2; :::; mg { the global conﬁguration of senses returned 12mii by the algorithm 258 Figure 9 1: An example of building a global sense conﬁguration with Shot- gunWSD for a document of 7 words The algorithm is based on three main phases: building local sense conﬁgurations using a brute-force approach, assem- bling shorter conﬁgurations into longer conﬁgurations by preﬁx-suﬃx matching and majority voting 1998; Miller, 1995], according to steps 7-8 of Algorithm 8 Naturally, we will con- sider that the sense conﬁguration of the whole document corresponds to the long DNA strand (whole genome) that needs to be sequenced In this context, sense conﬁgurations of short context windows (less than 10 words) will correspond to 259 the short DNA reads A crucial diﬀerence here is that we know the location of the context windows in the whole document from the very beginning, so our task will be much easier compared to Shotgun sequencing (we do not need to use a backbone solution for the alignment of short sense conﬁgurations) At every possible location in the text documentD, we select a window ofn words accord- ing to step 12 of Algorithm 8 The window length n is an external parameter of our algorithm that can be tuned for an optimal trade-oﬀ between accuracy and speed For each context window we compute all possible sense conﬁgura- tions, according to step 14-15 of Algorithm 8 A score is assigned to each sense conﬁguration by computing the semantic relatedness between word senses (steps 16-19), as described by Patwardhan et al [Patwardhan et al , 2003] Butnaru et al [Butnaru et al , 2017] alternatively employed two measures to compute the semantic relatedness, one is the extended Lesk measure [Banerjee & Pedersen, 2002, 2003] and the other is a simple approach based on deriving sense embed- dings from word embeddings [Mikolov et al , 2013] In this chapter, we present a third approach that is based on clustering word vectors with k-means and on eliminating the smaller clusters (which contain outlier words) For the sake of completeness, all three methods are described in Section 9 3 1 In the new version of ShotgunWSD, we modify step 19 in order to weight the relatedness score by the distance between the two ambiguous words, as in [Iacobacci et al , 2016] The reason for weighting the score is that if two words are farther apart from each other, their relatedness score should have a smaller contribution to the total score of the local sense conﬁguration For the assembly phase (steps 23-39), we keep the top scoring sense conﬁgurations, according to step 21 of Algorithm 8 In step 21, we use an internal parameterc in order to determine exactly how many sense conﬁgurations are retained per context window Another important remark is that we assume that the BF algorithm used for generating sense conﬁgurations for short windows does not produce output errors (as in genome sequencing), so it is not necessary to use a distance measure in order to ﬁnd overlaps for merging conﬁgurations We simply check if the suﬃx of a former conﬁguration coincides (matches exactly) with the preﬁx of a latter conﬁguration in order to join them together, according to steps 29-33 of Algorithm 8 The length l of the suﬃx and the preﬁx that get overlapped needs to be strictly greater then zero, so at least 260 one sense choice needs to coincide We gradually consider shorter and shorter suﬃx and preﬁx lengths starting with l = minf5; n 1g, according to step 23 of Algorithm 8 Sense conﬁgurations are assembled in order to obtain longer conﬁgurations (step 34), until none of the resulted conﬁgurations can be further merged together When merging, the relatedness score of the resulting conﬁgu- ration needs to be recomputed (steps 36-38), but we can take advantage of some of the previously computed scores (step 35) Longer conﬁgurations indicate that there is an agreement (regarding the chosen senses) that spans across a longer piece of text In other words, longer conﬁgurations are more likely to provide correct sense choices, since they inherently embed a higher degree of agreement among senses After the conﬁguration assembly phase, we start assigning the sense to each word in the document, according to step 40 of Algorithm 8 Ac- cording to step 42, we build a ranked list of sense conﬁgurations for each word in the document, based on the assumption that longer conﬁgurations provide better information about correct word senses Naturally, for a given word, we only consider the conﬁgurations that contain the respective word, according to step 41 Finally, the sense of each word is given by a majority vote on the top k conﬁgurations from its ranked list, according to steps 43-44 of Algorithm 8 The number of sense conﬁgurations k is an external parameter of our approach, and it can be tuned for optimal results 9 3 1 Semantic Relatedness For a sense conﬁguration assigned to a context window ofn words, we compute a semantic relatedness score (numeric value) between each pair of selected senses In steps 19 and 38 of Algorithm 8, the score is computed by the relatedness function, which takes two word senses as input and provides their semantic relatedness score as output Butnaru et al [Butnaru et al , 2017] used two diﬀerent approaches for computing the relatedness score In this chapter, we present a third approach We note that all three approaches are built on top of WordNet semantic relations Each of the three approaches can be regarded as a diﬀerent way of estimating the semantic relatedness of two WordNet synsets For each synset, we ﬁrst build a disambiguation vocabulary (also referred to as sense bag) by extracting words from 261 the WordNet [Fellbaum, 1998; Miller, 1995] lexical knowledge base, as described next Starting from the synset itself, we ﬁrst include all the synonyms that form the respective synset along with the content words of the gloss (examples included) We also include into the disambiguation vocabulary words indicated by speciﬁc WordNet semantic relations that are chosen according to the part- of-speech of the ambiguous word More precisely, we have considered hyponyms and meronyms for nouns For adjectives, we have considered similar synsets, antonyms, attributes, pertainyms and related (see also) synsets For verbs, we have considered troponyms, hypernyms, entailments and outcomes Finally, for adverbs, we have considered antonyms, pertainyms and topics These choices have been made because previous studies [Banerjee & Pedersen, 2003; Hristea et al , 2008] have reached the conclusion that using these speciﬁc relations for each part-of-speech provides better empirical results for the WSD task The disambiguation vocabulary generated by the WordNet feature selection described so far requires further processing in order to obtain our ﬁnal vocabulary This processing produces a closed and uniform feature set The ﬁrst processing step is to eliminate the stopwords The remaining words are stemmed using the Porter stemmer algorithm [Porter, 1980] The stemming process reduces a word to its root form by removing the most common morphological and inﬂexional endings from words in English The resulted stems represent the ﬁnal set of features that we use for computing the relatedness score between two synsets For the sake of completeness, we next describe the two approaches for computing the relatedness score proposed in [Butnaru et al , 2017], as well as our novel approach based on removing outlier words using k-means clustering 9 3 1 1 Extended Lesk Measure The original Lesk algorithm [Lesk, 1986] takes into account one word overlaps among the glosses of a target word and those that surround it in a given context Banerjee et al [Banerjee & Pedersen, 2002] consider that this is a signiﬁcant lim- itation of the original Lesk algorithm, since dictionary glosses tend to be fairly short and they fail to provide suﬃcient information to make ﬁne grained distinc- tions between word senses To this end, Banerjee et al [Banerjee & Pedersen, 262 2003] extend the original Lesk algorithm with a measure that takes two Word- Net synsets as input and returns a numeric value that quantiﬁes their degree of semantic relatedness by taking into consideration the glosses of related WordNet synsets as well Moreover, when comparing two glosses, the extended Lesk mea- sure considers overlaps of multiple consecutive words, based on the assumption that a longer phrase is more representative for the relatedness of the two synsets Given two input glosses, the longest overlap between them is detected and then replaced with a unique marker (symbol) in each of the two glosses The resulted glosses are then again checked for overlaps, and this process continues until there are no more overlaps The lengths of the detected overlaps are squared and added together to obtain the score for the given pair of glosses Depeding on the Word- Net relations used for each part-of-speech, several pairs of glosses are compared and summed up together to obtain the ﬁnal relatedness score However, WordNet does not deﬁne semantic relations between synsets if they do not belong to the same part-of-speech For this reason, we compute the semantic relatedness using only the WordNet glosses and examples when two words are of diﬀerent parts- of-speech Further details regarding the extended Lesk measure are provided by Banerjee et al [Banerjee & Pedersen, 2003] 9 3 1 2 Sense Embeddings In this section, we describe a simple approach based on word embeddings to measure the semantic relatedness of two synsets Approaches based on word em- beddings[Bengio et al , 2003; Collobert & Weston, 2008; Mikolov et al , 2013] represent words as a low-dimensional real-valued vectors, such that semantcially related words reside in close vicinity in the generated space In our algorithm, we employ the pre-trained word embeddings computed by theword2vecframe- work [Mikolov et al , 2013] on the Google News data set using the Skip-gram model This pre-trained model contains 300-dimensional vectors for nearly 3 million words and phrases We compute the relatedness score between two synsets as described next We embed each word in the disambiguation vocabulary of a synset in order to obtain the corresponding word vector We thus obtain a set of word embedding vectors 263 for each given synset We derive the sense embedding for a synset simply by computing the median of all the word embeddings in the corresponding set We can naturally assume that some of the word vectors in the set correspond to words that do not help the disambiguation process From this point of view, these words can be regarded as outliers In this context, we consider that using the (geometric) median instead of the mean is more appropriate, as the mean is largely inﬂuenced by outliers It is important to note that our third approach (presented next) for computing the semantic relatedness aims to properly address the outlier removal issue According to our second approach, the semantic relatedness of two synsets is simply given by the cosine similarity between their median vectors: Pm aibi relatedness(A; B) =i=1pPmpPm; i=1a2ii=1b2i whereA and B are m-dimensional median vectors corresponding to two WordNet synsets For the employedword2vecmodel, the vectors have m = 300 compo- nents An important remark is that Bhingardive et al [Bhingardive et al , 2015] proposed an approach based on the mean (instead of the median) of word vectors to construct sense embeddings, but with a slightly diﬀerent purpose than ours, namely to determine which synset is most similar to the target word, assuming that the respective synset should correspond to the most common sense of the respective word As such, they completely disregard the context of the target word Diﬀerent from their approach, we are trying to measure the semantic relatedness between two synsets of distinct words that appear in the same context window Furthermore, the empirical results presented in Section 9 4 show that our approach yields better performance than the MCS estimation approach of Bhingardive et al [Bhingardive et al , 2015], thus putting an even greater gap between the two methods 9 3 1 3 Sense Embeddings after Outlier Removal We start by gathering the words in the document that we aim to disambiguate into a set Along with the words in the document, we also add all the words 264 Figure 9 2: A set of 400 data points sampled from two normal distributions of diﬀerent means The points are clustered into 30 clusters using k-means The centroids of clusters with less than 10 samples are represented with a large blue square from the disambiguation vocabularies of each sense of each ambiguous word in the document Each word is then embedded into a vector space using the word2vec [Mikolov et al , 2013] framework Based on the fact that word embed- dings carry semantic information by projecting semantically related words in the same region of the embedding space, the next step is to cluster the word vectors in order to obtain relevant semantic clusters of words The words are clustered using k-means clustering with k-means++ [Arthur & Vassilvitskii, 2007] initialization Next, we eliminate the clusters with fewer samples, based on the assumption that these smaller clusters contain mostly outlier samples We motivate our assumption through the following toy example We generate 400 data points sampled from two normal distributions of diﬀerent means We group the points intok = 30 clusters using k-means and we illustrate the result in Figure 9 2 We then count the number of points in each cluster and obtain the histogram depicted in Figure 9 3 In this example, we consider that the clusters with less than 10 data points contain mostly outliers The centroids of these smaller clusters are marked with a large blue square in Figure 9 2 We can clearly see that the marked clusters are farthest from both normal distribution means, indicating 265 Figure 9 3: A histogram representing the number of data points in each cluster The histogram corresponds to the k-means clustering applied over the 400 data points illustrated in Figure 9 2 A threshold of 10 is used to detect clusters of outliers that the containing points are indeed outliers This assumption is also supported by the results in abnormal event detection in video presented in Chapter 5 and in [Ionescu et al , 2018], since the same approach (based on k-means) is employed to remove clusters of outlier motion samples Nevertheless, our aim is to test out this assumption by quantifying the performance improvement of ShotgunWSD 2 0 over its previous version To this end, we remove the words that belong to the eliminated clusters from each and every sense bag We next compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense The semantic relatedness of two synsets is given by the cosine similarity between the corresponding medians 9 4 Experiments and Results 9 4 1 Data Sets We evaluate the ﬁrst and second version of our global WSD algorithm on four data sets We compare both versions of ShotgunWSD with several state-of-the-art un- supervised and knowledge-based WSD methods, as long as the works presenting these methods [Apidianaki & Gong, 2015; Bhingardive et al , 2015; Chen et al , 2014; Manion, 2015; Schwab et al , 2013a; Torres & Gelbukh, 2009] report results 266 Table 9 1: A summary of the number of ambiguous words along with the distri- bution of ambiguous words per part-of-speech in the four data sets considered in our evaluation Data SetWordsNounsAdjectivesVerbsAdverbs SemEval 200722691108362591208 Senseval-224731136457581299 Senseval-3208195136475115 SemEval 2015117567116625484 on the data sets considered in this chapter We ﬁrst compare ShotgunWSD with the MCS baseline and two state-of-the- art approaches [Chen et al , 2014; Schwab et al , 2013a], on the SemEval 2007 coarse-grained English all-words task [Navigli et al , 2007] The SemEval 2007 coarse-grained English all-words data set1consists of 5 documents that contain 2269 ambiguous words altogether While Schwab et al [Schwab et al , 2013a] and Chen et al [Chen et al , 2014] report results on SemEval 2007, Bhingardive et al [Bhingardive et al , 2015] and Torres et al [Torres & Gelbukh, 2009] report results on Senseval-2 and Senseval-3 Hence, we also compare our approach with the MCS baseline, the MCS estimation method of Bhingardive et al [Bhingardive et al , 2015] and the extended Lesk algorithm [Torres & Gelbukh, 2009] on the Senseval- 2 English all-words [Edmonds & Cotton, 2001] and the Senseval-3 English all- words [Mihalcea et al , 2004] data sets The Senseval-2 data set1consists of 3 documents that contain 2473 ambiguous words, while the Senseval-3 data set1 consists of 3 documents that contain 2081 ambiguous words Finally, we compare our approach with the winner [Apidianaki & Gong, 2015] and the ﬁrst runner up [Manion, 2015] in the SemEval 2015 English all-words WSD task [Moro & Navigli, 2015] The SemEval 2015 data set [Moro & Navigli, 2015] is composed of 4 documents that contain 1175 ambiguous words A summary of the distribution of ambiguous words per part-of-speech in each data set is presented in Table 9 1 1 http://nlp cs swarthmore edu/semeval/tasks/index php 1 http://web eecs umich edu/$\sim$mihalcea/downloads html 267 84 2 84 83 8 F1 score 83 6 83 4 50100250500750 Number of clusters Figure 9 4: TheF1scores of ShotgunWSD 2 0 on the ﬁrst document of SemEval 2007, using diﬀerent numbers of clusters for k-means 9 4 2 Parameter Tuning For ShotgunWSD 1 0, we use the same parameters as in [Butnaru et al , 2017] The internal parameter c is set to 20, since this value gives a reasonable amount of conﬁguration choices for the subsequent steps, without using too much space and time The length of the context windows is set to n = 8, as ShotgunWSD 1 0 runs in a reasonable amount of time with this length, namely 187 seconds on the ﬁrst document of SemEval 2007 The reported time is measured on a computer with Intel Core i7 3:4 GHz processor and 16 GB of RAM using a single Core The ﬁnal sense for each word is assigned using a majority vote based on the top k = 15 conﬁgurations For ShotgunWSD 2 0, we use the same values for the parametersc and k However, the k-means clustering step requires additional processing time, so we need to reduce the window length to n = 6 in order to reach a processing time of the same order of magnitude to ShotgunWSD 1 0 On the same machine, ShotgunWSD 2 0 with n = 6 runs in 105 seconds on the ﬁrst document of SemEval 2007, which is slightly faster than ShotgunWSD 1 0 with n = 8 Butnaru et al [Butnaru et al , 2017] showed that the processing time grows exponentially with the window length, hencen = 8 would not a reasonable choice for ShotgunWSD 2 0 There are two additional parameters for ShotgunWSD 2 0: the number of k- 268 84 5 84 83 5 83 82 5 F1 score 82 81 5 0 10 250 50 10 250 250 5 Small cluster elimination threshold Figure 9 5: TheF1scores of ShotgunWSD 2 0 on the ﬁrst document of SemEval 2007, using diﬀerent thresholds for eliminating the smaller k-means clusters means clusters and the threshold used for outlier cluster elimination As in [But- naru et al , 2017; Schwab et al , 2013a], we tune our parameters on the ﬁrst document of SemEval 2007 By tuning the parameters on just one document from SemEval 2007, we avoid the overﬁtting to a particular data set We tested our algorithm using 50, 100, 250, 500 and 750 clusters and we found out that the performance starts to slightly drop after 250 clusters, as illustrated in Figure 9 4 In the end, we opted to use 250 clusters in all the experiments We also tried to ﬁnd out if eliminating 10%, 25% or 50% of the smaller (outlier) clusters would have a diﬀerent eﬀect on performance As shown in Figure 9 5, it seems that eliminating 25% of the clusters is the optimal choice It is important to note that we use the same parameters throughout the subsequent experiments on all four data sets 9 4 3 Results on SemEval 2007 First, we conduct an empirical study on the SemEval 2007 coarse-grained En- glish all-words task in order to evaluate the performance of ShotgunWSD 1 0 and 2 0 as well as other WSD methods As described in Section 9 3 1, we use two alternative approaches for computing the semantic relatedness scores in Shot- gunWSD 1 0, namely extended Lesk and sense embeddings ShotgunWSD 2 0 is 269 Table 9 2: The F1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of various unsupervised state-of-the-art WSD approaches, on the SemEval 2007 coarse-grained English all-words task The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows of n = 6 words, a majority vote on the top k = 15 conﬁgurations and k-means clustering with 250 clusters MethodF1Score Most Common Sense78:89% Genetic Algorithms [Schwab et al , 2013a]74:53% Simulated Annealing [Schwab et al , 2013a]75:18% Ant Colony [Schwab et al , 2013a]79:03% S2C Unsupervised [Chen et al , 2014]75:80% ShotgunWSD 1 0 + Extended Lesk [Butnaru et al , 2017]79:15% ShotgunWSD 1 0 + Sense Embeddings [Butnaru et al , 2017]79:68% ShotgunWSD 2 081:22% based on a diﬀerent approach which eliminates smaller clusters of word embed- dings We compare our two versions of ShotgunWSD with several bio-inspired algorithms described in [Schwab et al , 2012, 2013a], namely a Genetic Algorithm, Simulated Annealing, and Ant Colony Optimization Along with the bio-inspired algorithms, we include an approach based on sense embeddings [Chen et al , 2014] in our comparative study All the approaches considered in the evaluation are unsupervised We compare them with the MCS baseline that is based on hu- man annotations The F1scores of the enumerated methods are all presented in Table 9 2 It seems that the Ant Colony Optimization algorithm based on the weighted voting scheme described in [Schwab et al , 2013a] is the only method, among all state-of-the-art methods, that is capable of surpassing the MCS base- line The unsupervised S2C approach yields roughly 3% lower results than the MCS baseline, but Chen et al [Chen et al , 2014] report better results in a semi- supervised setting All versions of ShotgunWSD attain better results than the MCS baseline (78:89%) and the best state-of-the-art method, namely the Ant Colony Optimization algorithm (79:03%) Indeed, ShotgunWSD 1 0 obtains an F1score of 79:15% when using the extended Lesk measure and an F1score of 270 Table 9 3: The F1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of an unsupervised WSD approach and the extended Lesk measure, on the Senseval-2 English all-words data set The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k= 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows of n = 6 words, a majority vote on the top k = 15 conﬁgurations and k-means clustering with 250 clusters MethodF1Score Most Common Sense60:10% MCS Estimation [Bhingardive et al , 2015]52:34% Extended Lesk [Torres & Gelbukh, 2009]54:60% ShotgunWSD 1 0 + Extended Lesk [Butnaru et al , 2017]55:78% ShotgunWSD 1 0 + Sense Embeddings [Butnaru et al , 2017]57:55% ShotgunWSD 2 058:24% 79:68% when using sense embeddings Interestingly, we observe that Shotgun- WSD 1 0 gives slightly better results when sense embeddings are used instead of the extended Lesk method Nevertheless, ShotgunWSD 2 0 obtains even better results It surpasses ShotgunWSD 1 0 by 1:54% and the second best approach (Ant Colony Optimization) by 2:19% To our knowledge, the F1score of Shot- gunWSD 2 0 (81:22%) is the best among all unsupervised methods evaluated on the SemEval 2007 coarse-grained English all-words task 9 4 4 Results on Senseval-2 In Table 9 3, we present theF1scores of the two versions of ShotgunWSD against the MCS baseline, the MCS estimation approach [Bhingardive et al , 2015] and the extended Lesk measure [Torres & Gelbukh, 2009], on the Senseval-2 English all-words data set The empirical results presented in Table 9 3 indicate that ShotgunWSD 1 0 based on sense embeddings obtains an F1score that is almost 5% better than the F1score of Bhingardive et al [Bhingardive et al , 2015] In the same time, ShotgunWSD 1 0 based on the extended Lesk method gives anF1 score that is around 1% better than theF1score reported by Torres et al [Torres & Gelbukh, 2009] It is important to note that Torres et al [Torres & Gelbukh, 271 Table 9 4: The F1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of an unsupervised WSD approach and the extended Lesk measure, on the Senseval-3 English all-words data set The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k= 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows of n = 6 words, a majority vote on the top k = 15 conﬁgurations and k-means clustering with 250 clusters MethodF1Score Most Common Sense62:30% MCS Estimation [Bhingardive et al , 2015]43:28% Extended Lesk [Torres & Gelbukh, 2009]49:60% ShotgunWSD 1 0 + Extended Lesk [Butnaru et al , 2017]57:89% ShotgunWSD 1 0 + Sense Embeddings [Butnaru et al , 2017]59:82% ShotgunWSD 2 059:92% 2009] apply the extended Lesk measure by performing the brute-force search at the sentence level (not on the whole document), hence it is not surprising that ShotgunWSD 1 0 is able obtain better results Despite of using windows of shorter length (6 instead of 8), ShotgunWSD 2 0 is able to yield better performance than both variants of ShotgunWSD 1 0 The improvement with respect to the better variant of ShotgunWSD 1 0, the one based on sense embeddings, is 0:69% However, ShotgunWSD 2 0 (58:22%) is still under the MCS baseline (60:10%) on this data set 9 4 5 Results on Senseval-3 We also compare ShotgunWSD 1 0 and 2 0 with the MCS baseline, the MCS esti- mation approach of Bhingardive et al [Bhingardive et al , 2015] and the extended Lesk measure [Torres & Gelbukh, 2009] on the Senseval-3 English all-words data set The correspondingF1scores are presented in Table 9 4 With anF1score of 57:89%, ShotgunWSD 1 0 based on the extend Lesk measure brings a remarkable improvement of 8% over the extended Lesk algorithm applied at the sentence level [Torres & Gelbukh, 2009] Moreover, the empirical results indicate that all versions of ShotgunWSD reach considerably better F1scores compared to the 272 Table 9 5: TheF1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus theF1 scores of two state-of-the-art WSD approaches, on the SemEval 2015 English all- words task The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows ofn = 6 words, a majority vote on the top k = 15 conﬁgurations and k-means clustering with 250 clusters MethodF1Score BabelNet First Sense [Moro & Navigli, 2015]66:30% LIMSI [Apidianaki & Gong, 2015]64:70% Sudoku [Manion, 2015]59:90% ShotgunWSD 1 0 + Extended Lesk [Butnaru et al , 2017]45:66% ShotgunWSD 1 0 + Sense Embeddings [Butnaru et al , 2017]58:44% ShotgunWSD 2 061:30% MCS estimation approach [Bhingardive et al , 2015] By using sense embeddings in a completely diﬀerent way than Bhingardive et al [Bhingardive et al , 2015], ShotgunWSD 1 0 attains anF1score of 59:82%, which is 16:54% above the MCS estimation approach [Bhingardive et al , 2015] On this data set, the improve- ment of ShotgunWSD 2 0 over ShotgunWSD 1 0 is very small (from 59:82% to 59:92%) 9 4 6 Results on SemEval 2015 Table 9 5 shows the results of ShotgunWSD 1 0 and 2 0 against the top two methods [Apidianaki & Gong, 2015; Manion, 2015] from the SemEval 2015 En- glish all-words WSD task [Moro & Navigli, 2015] The table also includes the BabelNet ﬁrst sense (BFS) [Moro & Navigli, 2015] as reference We ﬁrst note that ShotgunWSD 1 0 based on extended Lesk gives considerably worse results (45:66%) than ShotgunWSD 1 0 based on sense embeddings (58:44%) In the same time, ShotgunWSD 2 0 attains better performance than both variants of ShotgunWSD 1 0 TheF1score of ShotgunWSD 2 0 (61:30%) is also 1:4% better than the approach of Manion [Manion, 2015] On the other hand, the performance of ShotgunWSD 2 0 is 3:4% under the performance of the winners [Apidianaki 273 & Gong, 2015] of the SemEval 2015 English all-words WSD task However, it is important to remark that Apidianaki et al [Apidianaki & Gong, 2015] exploit the parallelism of the multilingual SemEval 2015 test data by using translations as source of indirect supervision for sense selection As Manion [Manion, 2015] and the rest of the participants [Moro & Navigli, 2015], we do not use this kind of information in our algorithm 9 5 Discussion In this chapter, we have presented a new version (2 0) of a recently introduced global WSD algorithm [Butnaru et al , 2017] inspired by the Shotgun genome sequencing technique [Anderson, 1981] Compared to other bio-inspired WSD methods [Schwab et al , 2012, 2013a], our algorithm has less parameters Fur- thermore, these parameters can be intuitively tuned with respect to the WSD task Considering the empirical results on all four data sets included in our eva- lutaion, we can conclude that ShotgunWSD 2 0 attains generally better results (sometimes up to 2 3%) than ShotgunWSD 1 0 based on sense embeddings, which in turn, is better than ShotgunWSD 1 0 based on the extended Lesk mea- sure On one of the data sets (SemEval 2007), all versions of ShotgunWSD yield better performance than the MCS baseline We need to underline that the strong MCS baseline is not a viable approach for practical situations, since human input is required to indicate which sense of a word is the most frequent in a given text Since a word's dominant sense will vary across domains and text genres, it is not a trivial task to develop an NLP approach that determines the most com- mon sense, as it can be observed from the performance gap between the MCS baseline and the MCS estimation approach of Bhingardive et al [Bhingardive et al , 2015] Corpora used for the evaluation of WSD algorithms usually contain annotations regarding the most commen sense, but the MCS baseline will not work outside the annotated data Therefore, researchers in the WSD commu- nity [Agirre & Edmonds, 2006] consider important even slightly outperforming the MCS baseline with an unsupervised method In light of this comment, we consider remarkable the fact that ShotgunWSD surpasses the MCS baseline on SemEval 2007 Furthermore, our algorithm compares favorably to other state-of- 274 REFERENCES the-art unsupervised WSD methods [Bhingardive et al , 2015; Chen et al , 2014; Manion, 2015; Schwab et al , 2013a] and to the extended Lesk measure [Banerjee & Pedersen, 2002; Torres & Gelbukh, 2009] Regarding the performance of our algorithm, an interesting question that arises is how much does the assembly phase help We carried out a small exper- iment to provide an answer to this question We considered the ShotgunWSD 1 0 variant based on sense embeddings without changing its parameters, and we removed the assembly phase completely Therefore, the algorithm did no longer produce conﬁgurations of length greater than 8, as the parametern is set to 8 We have evaluated this stub algorithm on SemEval 2007 and we have obtained a lower F1score (77:61%) This result indicates that the assembly phase in Algorithm 8 boosts the performance by nearly 2% It is perhaps interesting to note that we have considered an approach to combine the two semantic relatedness approaches independently used by Shot- gunWSD 1 0, namely the extended Lesk measure and sense embeddings, with the goal of improving the accuracy However, we did not observe any improvements when fusing these two measures For this reason, we did not report any results of the combination in the chapter In future work, we aim to investigate if training sense embeddings instead of deriving them from pre-trained word embeddings could yield better accuracy Another promising direction is to compute the semantic relatedness of sense con- ﬁgurations based on the sum of sense tuples instead of sense pairs References Agirre, Eneko and Edmonds, Philip Glenny Word Sense Disambiguation: Algo- rithms and Applications Springer, 2006 (cited on 252, 255, 275) Agirre, Eneko, Lopez de Lacalle, Oier, and Soroa, Aitor Random Walks for Knowledge-based Word Sense Disambiguation Computational Linguistics, 40 (1):57{84, March 2014 (cited on 255) Anderson, Stephen Shotgun DNA sequencing using cloned DNase I-generated 275 REFERENCES fragments Nucleic Acids Research, 9(13):3015{3027, 1981 (cited on 253, 257, 275) Apidianaki, Marianna and Gong, Li LIMSI: Translations as Source of Indirect Supervision for Multilingual All-Words Sense Disambiguation and Entity Link- ing In Proceedings of SemEval, pp 298{302, 2015 (cited on 267, 268, 274) Arthur, David and Vassilvitskii, Sergei k-means++: The Advantages of Careful Seeding In Proceedings of SODA, pp 1027{1035, Philadelphia, PA, USA, 2007 Society for Industrial and Applied Mathematics (cited on 265) Banerjee, Satanjeev and Pedersen, Ted An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet InProceedings of CICLing, pp 136{ 145, London, UK, UK, 2002 Springer-Verlag (cited on 253, 254, 255, 260, 262, 275) Banerjee, Satanjeev and Pedersen, Ted Extended Gloss Overlaps As a Measure of Semantic Relatedness In Proceedings of IJCAI, pp 805{810, San Francisco, CA, USA, 2003 Morgan Kaufmann Publishers Inc (cited on 253, 254, 255, 260, 262, 263) Bengio, Yoshua, Ducharme, Rejean, Vincent, Pascal, and Janvin, Christian A Neural Probabilistic Language Model Journal of Machine Learning Research, 3:1137{1155, March 2003 (cited on 254, 256, 263) Bennett, Simon Solexa Ltd Pharmacogenomics, 5(4):433{438, June 2004 (cited on 257) Bhingardive, Sudha, Singh, Dhirendra, V, Rudramurthy, Redkar, Hanu- mant Harichandra, and Bhattacharyya, Pushpak Unsupervised Most Frequent Sense Detection using Word Embeddings In Proceedings of NAACL, pp 1238{ 1243 The Association for Computational Linguistics, 2015 (cited on 252, 255, 256, 264, 267, 268, 272, 273, 275) Butnaru, Andrei, Ionescu, Radu Tudor, and Hristea, Florentina ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by 276 REFERENCES DNA sequencing In Proceedings of EACL, pp 916{926, 2017 (cited on 252, 254, 255, 260, 261, 262, 268, 269, 270, 272, 273, 274, 275) Carpuat, Marine and Wu, Dekai Improving statistical machine translation using word sense disambiguation In Proceedings of EMNLP, pp 61{72, 2007 (cited on 252) Chen, Xinxiong, Liu, Zhiyuan, and Sun, Maosong A Uniﬁed Model for Word Sense Representation and Disambiguation InProceedings of EMNLP, pp 1025{1035, Doha, Qatar, October 2014 Association for Computational Lin- guistics (cited on 252, 255, 256, 267, 270, 271, 275) Chifu, Adrian-Gabriel and Ionescu, Radu Tudor Word sense disambiguation to improve precision for ambiguous queries Central European Journal of Com- puter Science, 2(4):398{411, 2012 (cited on 252) Collobert, Ronan and Weston, Jason A Uniﬁed Architecture for Natural Lan- guage Processing: Deep Neural Networks with Multitask Learning In Proceed- ings of ICML, pp 160{167, New York, NY, USA, 2008 ACM (cited on 254, 256, 263) Edmonds, Philip and Cotton, Scott SENSEVAL-2: Overview In Proceedings of SENSEVAL, pp 1{5, Stroudsburg, PA, USA, 2001 Association for Computa- tional Linguistics (cited on 255, 268) Fellbaum, Christiane (ed ) WordNet: An Electronic Lexical Database MIT Press, 1998 (cited on 253, 255, 257, 261) Hristea, Florentina, Popescu, Marius, and Dumitrescu, Monica Performing word sense disambiguation at the border between unsupervised and knowledge-based techniques Artiﬁcial Intelligence Review, 30(1-4):67{86, 2008 (cited on 252, 262) Iacobacci, Ignacio, Pilehvar, Mohammad Taher, and Navigli, Roberto Embed- dings for Word Sense Disambiguation: An Evaluation Study InProceedings of ACL, pp 897{907, August 2016 (cited on 252, 255, 256, 260) 277 REFERENCES Ionescu, Radu Tudor, Smeureanu, Sorina, Popescu, Marius, and Alexe, Bogdan Detecting abnormal events in video using Narrowed Motion Clusters CoRR, abs/1801 05030, 2018 URLhttp://arxiv org/abs/1801 05030 (cited on 266) Istrail, Sorin, Sutton, Granger G , Florea, Liliana, Halpern, Aaron L , Mo- barry, Clark M , Lippert, Ross, Walenz, Brian, Shatkay, Hagit, Dew, Ian, Miller, Jason R , Flanigan, Michael J , Edwards, Nathan J , Bolanos, Ran- dall, Fasulo, Daniel, Halldorsson, Bjarni V , Hannenhalli, Sridhar, Turner, Russell, Yooseph, Shibu, Lu, Fu, Nusskern, Deborah R , Shue, Bixiong Chris, Zheng, Xiangqun Holly, Zhong, Fei, Delcher, Arthur L , Huson, Daniel H , Kravitz, Saul A , Mouchard, Laurent, Reinert, Knut, Remington, Karin A , Clark, Andrew G , Waterman, Michael S , Eichler, Evan E , Adams, Mark D , Hunkapiller, Michael W , Myers, Eugene W , and Venter, J Craig Whole Genome Shotgun Assembly and Comparison of Human Genome Assemblies Proceedings of the National Academy of Sciences, 101(7):1916{1921, 2004 (cited on 253, 257) Lesk, Michael Automatic Sense Disambiguation Using Machine Readable Dic- tionaries: How to Tell a Pine Cone from an Ice Cream Cone In Proceedings of SIGDOC, pp 24{26, New York, NY, USA, 1986 ACM (cited on 253, 255, 262) Levenshtein, V I Binary codes capable of correcting deletions, insertions and reverseals Cybernetics and Control Theory, 10(8):707{710, 1966 (cited on 257) Manion, Steve L SUDOKU: Treating Word Sense Disambiguation & Entity Link- ing as a Deterministic Problem { via an Unsupervised & Iterative Approach InProceedings of SemEval, pp 365{369, 2015 (cited on 255, 267, 268, 274, 275) Mihalcea, Rada, Chklovski, Timothy, and Kilgarriﬀ, Adam The Senseval-3 En- glish Lexical Sample Task In Proceedings of SENSEVAL-3, pp 25{28, Strouds- burg, PA, USA, July 2004 Association for Computational Linguistics (cited on 255, 268) 278 REFERENCES Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gregory S , and Dean, Jeﬀrey Distributed Representations of Words and Phrases and their Compo- sitionality In Proceedings of NIPS, pp 3111{3119, 2013 (cited on 254, 256, 260, 263, 264) Miller, George A WordNet: A Lexical Database for English Communications of the ACM, 38(11):39{41, November 1995 (cited on 253, 255, 259, 261) Moro, Andrea and Navigli, Roberto Semeval-2015 task 13: Multilingual all- words sense disambiguation and entity linking In Proceedings of SemEval, pp 288{297, 2015 (cited on 255, 268, 274) Navigli, Roberto Word sense disambiguation: A survey ACM Computing Sur- veys, 41(2):10:1{10:69, February 2009 (cited on 252, 255) Navigli, Roberto, Litkowski, Kenneth C , and Hargraves, Orin SemEval-2007 Task 07: Coarse-grained English All-words Task In Proceedings of SemEval, pp 30{35, Stroudsburg, PA, USA, 2007 Association for Computational Lin- guistics (cited on 255, 267) Nguyen, Kiem-Hieu and Ock, Cheol-Young Word sense disambiguation as a traveling salesman problem Artiﬁcial Intelligence Review, 40(4):405{427, 2013 (cited on 255) Panchenko, Alexander, Ruppert, Eugen, Faralli, Stefano, Ponzetto, Si- mone Paolo, and Biemann, Chris Unsupervised Does Not Mean Uninter- pretable: The Case for Word Sense Induction and Disambiguation In Proceed- ings of EACL, pp 86{98, Valencia, Spain, 2017 Association for Computational Linguistics (cited on 255, 256) Patel, Ravi K and Jain, Mukesh NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data PLoS ONE, 7(2):1{7, 02 2012 (cited on 257) Patwardhan, Siddharth, Banerjee, Satanjeev, and Pedersen, Ted Using Measures of Semantic Relatedness for Word Sense Disambiguation InProceedings of 279 REFERENCES CICLing, pp 241{257, Berlin, Heidelberg, 2003 Springer-Verlag (cited on 253, 255, 256, 260) Plaza, Laura, Jimeno-Yepes, Antonio Jose, Diaz, Alberto, and Aronson, Alan R Studying the correlation between diﬀerent word sense disambiguation methods and summarization eﬀectiveness in biomedical texts BMC Bioinformatics, 12: 355{367, 2011 (cited on 252) Porter, Martin F An algorithm for suﬃx stripping Program, 14(3):130{137, 1980 (cited on 262) Schwab, Didier, Goulian, Jer^ome, Tchechmedjiev, Andon, and Blanchon, Herve Ant Colony Algorithm for the Unsupervised Word Sense Disambiguation of Texts: Comparison and Evaluation InProceedings of COLING, pp 2389{ 2404, Mumbai, India, December 2012 (cited on 252, 253, 254, 255, 256, 271, 275) Schwab, Didier, Goulian, Jer^ome, and Tchechmedjiev, Andon Worst-case Com- plexity and Empirical Evaluation of Artiﬁcial Intelligence Methods for Unsu- pervised Word Sense Disambiguation International Journal of Engineering and Technology, 8(2):124{153, August 2013a (cited on 252, 253, 254, 255, 256, 267, 269, 270, 271, 275) Schwab, Didier, Tchechmedjiev, Andon, and Goulian, Jer^ome GETALP: Prop- agation of a Lesk Measure through an Ant Colony Algorithm In Proceedings of SemEval, volume 1, pp 232{240, June 2013b (cited on 252, 255) Sumanth, Chiraag and Inkpen, Diana How much does word sense disambiguation help in sentiment analysis of micropost data? In Proceedings of WASSA, pp 115{121, September 2015 (cited on 252) Torres, Sulema and Gelbukh, Alexander Comparing Similarity Measures for Original WSD Lesk Algorithm Research in Computing Science, 43:155{166, 2009 (cited on 267, 268, 272, 273, 275) Vidhu Bhala, R V and Abirami, S Trends in word sense disambiguation Arti- ﬁcial Intelligence Review, 42(2):159{171, 2014 (cited on 255) 280 REFERENCES Voelkerding, Karl V , Dames, Shale A , and Durtschi, Jacob D Next Generation Sequencing: From Basic Research to Diagnostics Clinical Chemistry, 55(4): 41{47, 2009 (cited on 257) 281 Chapter 10 Conclusions and Future Work Abstract This chapter presents the general conclusions of this thesis The conclusions point to the fact that the concept of treating image and text in a similar way is indeed fertile We also provide some general guidelines on future work and discuss new directions that could arise by transferring knowledge between computer vision, text mining and computational biology 10 1 Discussion and Conclusions Machine learning is currently a vast area of research with applications in a variety of ﬁelds, such as computer vision [Forsyth & Ponce, 2002; Krizhevsky et al , 2012; Szeliski, 2010; Zhang et al , 2007], computational biology [Dinu & Ionescu, 2013; Inza et al , 2010; Leslie et al , 2002], information retrieval [Chifu & Ionescu, 2012; Ionescu et al , 2015b; Manning et al , 2008], natural language processing [Cozma et al , 2018; Lodhi et al , 2002; Sebastiani, 2002], data mining [Han et al , 2011], and many others [Ionescu et al , 2015a] In this thesis, we have presented sev- eral machine learning methods that are designed for speciﬁc tasks that belong to computer vision, computational biology or text mining The studied tasks range from object recognition, gesture recognition and abnormal event detection in video, to sequence alignment, text categorization by topic, automatic essay 282 scoring, polarity classiﬁcation and word sense disambiguation For this broad range of applications, we have employed several similarity-based learning or deep learning methods presented in Chapter 2 More speciﬁcally, we have studied ap- proaches such as nearest neighbor models, kernel methods, clustering methods and convolutional neural networks The studied methods exhibit state-of-the-art performance levels in the approached tasks The applications approached in this thesis can be divided into two areas that are traditionally considered as diﬀerent research ﬁelds, namely computer vision on one hand and string processing on the other While computer vision deals with image data, string processing refers to the analysis of string data in the form of text documents, DNA strings, and so on Although at ﬁrst sight computer vi- sion and string processing seem to be unrelated ﬁelds of study, recent results, such as the ones presented in this thesis, suggest that image and string analysis can be approached in similar ways Indeed, the concept of treating image and text in a similar fashion has proven to be very fertile for particular applications in computer vision [Duygulu et al , 2002; Farhadi et al , 2010; Leung & Malik, 2001; Sadeghi & Farhadi, 2011; Sivic et al , 2005] and text mining [Barnard & Johnson, 2005; Barnard et al , 2003; Johnson & Zhang, 2015; Pu et al , 2007] The concept of treating image and text in a similar manner represents the cor- nerstone concept of this thesis Hence, several methods that are based on this underlying concept have been thoroughly presented in individual chapters First, we presented an improvement to the popular bag-of-visual-words model in Chap- ter 3 This model is inspired by the bag-of-words model from text mining and information retrieval The improvement consists of encoding spatial information in an eﬃcient manner through the Spatial Non-Alignment Kernel (SNAK) Sec- ond, we described an unsupervised as well as a supervised method for abnormal event detection in Chapter 5 The unsupervised method is based on unmasking, a technique that was previously used for authorship veriﬁcation of text documents The supervised method is a two-stage outlier detection method that eliminates smaller k-means clusters in the ﬁrst stage Third, a new distance measure for gesture recognition in video, namely the Local Frame Match Distance, has been presented in Chapter 4 It is inspired from the distance measure for strings, namely the Local Rank Distance, presented in Chapter 6 Designed to conform 283 to more general principles and adapted to DNA strings, Local Rank Distance has demonstrated that it can achieve better results than several state-of-the-art methods for DNA sequence alignment Fourth, two approaches for including spa- tial information in the bag-of-visual-words, namely the spatial pyramid and the Spatial Non-Alignment Kernel, have been applied on text data in Chapter 7, improving the results of the bag-of-words model in the context of text categoriza- tion by topic Fifth, we transferred the bag-of-visual-words model from computer vision to text mining, by replacing the local image descriptors with word em- beddings, as detailed in Chapter 8 We termed the resulted representation the bag-of-super-word-embeddings (BOSWE) Moreover, we applied the intersection kernel, which is widely used in computer vision, on top of BOSWE and we ob- tained state-of-the-art results in automatic essay scoring by combing BOSWE with string kernels Lastly, we adapted the Shotgun genome sequencing for word sense disambiguation To eliminate outlier word senses, we applied the same outlier detection method as in the ﬁrst stage of the supervised abnormal event detection method presented in Chapter 5 To summarize, all the approaches pre- sented in this thesis come to support the concept of treating image and text in a similar manner However, it must be pointed out that, in general, most ap- proaches have to be redesigned or adapted in order work on diﬀerent data types Usually, the amount of work required to transfer a speciﬁc concept or method from one domain to the other depends on the place of the respective concept or method in the processing pipeline The closer the concept is to the raw data type, the harder it is to adapt it for a new data type This is the case of the distance or similarity measures that work directly on image or text data, such as Local Frame Match Distance or Local Rank Distance The concepts that sit somewhere in the middle of the processing pipeline can be adapted more easily This is the case of the SNAK framework or the BOSWE framework Finally, it becomes almost trivial to transfer the concepts that are applied at the end of the processing pipeline For instance, classiﬁcation methods such as Support Vector Machines or Kernel Ridge Regression can be directly applied on various classi- ﬁcation tasks from computer vision or text mining, without requiring any other changes than parameter tuning It must be pointed out that, in this thesis, the trivial cases have not been considered as proper examples of knowledge transfer, 284 although such eﬀorts are always appreciated in literature [Joachims, 1998] Although a signiﬁcant amount of research has been conducted using the idea of borrowing and adapting concepts from text processing to computer vision, or from computer vision to text processing, the concept of treating image and text (or more generally, strings) in a similar fashion is far from saturated The methods presented in this thesis barely scratch the surface on this topic Surely, there are still many concepts and methods studied and applied in a single ﬁeld, waiting to be discovered and used by researchers in other ﬁelds of study Nonetheless, it should be pointed out that it does not always make sense to consider knowledge transfer Indeed, there are many approaches and tasks that are very speciﬁc to one domain or the other One such example is the objectness measure [Alexe et al , 2010, 2012], which quantiﬁes how likely it is for an image window to contain an object In computer vision, the task of identifying image regions that contain objects is not trivial and it requires elaborate methods such as the objectness measure Trying to develop an elaborate method to determine if a text window (the equivalent of an image window) contains a meaningful concept is really not necessary This can be achieved simply by eliminating the stop words in the text window To conclude, this thesis represents a strong argument in favor of treating image and text in a similar fashion, a concept that is very promising and truly fertile for some speciﬁc applications in computer vision and text mining 10 2 Future Work In future work, we aim to study curriculum learning strategies in order to train better models for image and text classiﬁcation tasks A part of this future work has already received funding through the project PN-III-P1-1 1-PD-2016-0787 entitled \Object recognition in images using curriculum learning" We next de- scribe the issues of the state-of-the-art deep learning models that we aim to solve through our project We also present the project's objectives Convolutional neural networks [He et al , 2016; Krizhevsky et al , 2012; Si- monyan & Zisserman, 2014; Szegedy et al , 2015] have become the state-of-the- 285 Figure 10 1: Images with diﬃculty scores predicted by our system in increasing order of their diﬃculty art approach for object recognition in images, and they are more and more used for various computer vision tasks, for example abnormal event detection [Ionescu et al , 2017; Smeureanu et al , 2017] Researchers have focused on building deeper and deeper architectures, this being the main driver for the recent performance improvements For instance, the CNN model of Krizhevsky et al [Krizhevsky et al , 2012] reaches a top-5 error of 15:4% with only 8 layers, while the ResNet model [He et al , 2016] reaches a top-5 error of 3:6% with 152 layers While the CNN architecture has evolved over the last few years to accommodate more convolutional layers, reduce the size of the ﬁlters, and even eliminate the fully- connected layers, little attention has been paid to improving the training process, with few exceptions from the literature [Chen & Gupta, 2015; Lee & Grauman, 2011] An important limitation of the state-of-the-art CNN models is that ex- amples are considered in a random order during training Since the CNN archi- tecture is inspired by the human visual cortex, it seems reasonable to consider that the learning process should also be inspired by how humans learn One es- sential diﬀerence from how machines are trained is that humans learn the basic (easy) concepts sooner and the advanced (hard) concepts later This is essen- tially reﬂected in all the curricula taught in schooling systems around the world, as humans learn much better when the examples are not randomly presented but organized in a meaningful order Using a similar strategy for training a machine learning model, we can achieve two important beneﬁts: (i) an increase of the 286 convergence speed of the training process and (ii) a better accuracy Curricu- lum learning [Bengio et al , 2009] formalizes the easy-to-hard training strategies in the context of machine learning The work of [Bengio et al , 2009] showed that a curriculum learning strategy brings the expected beneﬁts for artiﬁcially generated data In order to use a curriculum learning strategy to train better CNN models for object recognition from natural images, we need a way to de- termine which images are easy and which are diﬃcult from the perspective of the object recognition task Since this is not a trivial task, Bengio et al [Bengio et al , 2009] limited their experiments to artiﬁcially generated data However, our recent work [Ionescu et al , 2016] on estimating image diﬃculty, deﬁned as the human response time for solving a visual search task, enables us to explore the curriculum learning paradigm for training CNN models on natural images By all means, we can use our image diﬃculty predictor to arrange the images in their increasing order of diﬃculty, as shown in Figure 10 1 Then, we can adopt a curriculum learning strategy by starting the training process using only the easiest samples and by gradually introducing more diﬃcult samples Our main objective is to improve a state-of-the-art CNN model [He et al , 2016; Szegedy et al , 2015] by adopting a curriculum learning strategy Diﬀerent curriculum learning strategies might have diﬀerent outcomes, therefore, we aim to investigate three curriculum learning strategies in order to ﬁnd the best strat- egy and to maximize our improvements in terms of accuracy and training time These strategies are detailed next Perhaps the most straightforward and simple approach is to train the CNN model by gradually adding more diﬃcult exam- ples Instead of considering all the training images from the beginning, we will divide the training set of images into a number ofk batches, such that the images with the lowest diﬃculty scores are placed in the ﬁrst batch, those with lowest diﬃculty scores not included in the ﬁrst batch are placed in the second batch, and so on, until the last batch contains the images with the highest diﬃculty scores The CNN model will be trained on the ﬁrst batch of images for a number of t epochs Then, the second batch will be included in the training set, and the CNN model will be trained for another t epochs The process goes on until we feed all the batches to the CNN model during training Our second strategy is to train specialized CNN models for each level of diﬃculty This time, we will 287 divide the training set of images into three batches, such that the images with the lowest diﬃculty scores are placed in the ﬁrst (easy) batch, and the images with the highest diﬃculty scores are placed in the third (diﬃcult) batch, leaving the images with medium diﬃculty scores in the second batch For the easy batch, we will train a less deep architecture, e g 8-layer VGG-f, for the medium batch a deeper architecture, e g 50-layer ResNet, and for the diﬃcult batch an even deeper architecture, e g 152-layer ResNet Thus, we will obtain three diﬀerent CNN models, one for each level of diﬃculty Intuitively, a deeper the architecture should be able to better represent the more complex patterns that usually occur in diﬃcult images: multiple objects of various classes scatter across the entire image [Ionescu et al , 2016] In the test phase, we will use our image diﬃculty predictor to assess the diﬃculty level of the test image and pass it to the corre- sponding CNN model Our third strategy is based on training the neural network by feeding all the examples from the beginning using random mini-batches, as usual However, before the training process starts, we will assign a weight to each training image according to its diﬃculty score: the easier images will have a higher weight and the harder images will have a lower weight This can be achieved by modifying the loss function that we want to optimize by training the CNN model During the training process, we will gradually increase the weights of the harder images until they all become equal near the end The idea is that we want our model to consider the easier examples more important at ﬁrst, i e we can aﬀord a greater loss for the harder examples However, while the network learns the easier examples, we want it to pay more and more attention the harder examples too Studying curriculum learning strategies to train CNN models better than the current state-of-the-art [He et al , 2016; Szegedy et al , 2015] will have a large- scale impact in the computer vision community, since researchers have adopted state-of-the-art CNN models for more and more computer vision applications by ﬁne-tuning the models to the speciﬁc tasks Indeed, CNN models are now used in many practical applications including object recognition systems built- in autonomous vehicles, content-based image retrieval, face recognition, gesture recognition, remote sensing, and so on The success of adopting curriculum learn- ing strategies for training neural networks for object recognition will also have 288 REFERENCES an impact in other research areas, for instance researchers might consider using curriculum learning for training convolutional neural networks used in speech recognition or natural language processing In fact, we aim to continue our work by investigating curriculum learning strategies for training better deep text clas- siﬁcation models References Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio What is an object? In Proceedings of CVPR, pp 73{80, San Francisco, CA, USA, June 2010 IEEE (cited on 285) Alexe, Bogdan, Deselaers, Thomas, and Ferrari, Vittorio Measuring the object- ness of image windows IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189{2202, 2012 (cited on 285) Barnard, Kobus and Johnson, Matthew Word sense disambiguation with pic- tures Artiﬁcial Intelligence, 167(1-2):13{30, September 2005 (cited on 283) Barnard, Kobus, Duygulu, Pinar, Forsyth, David, de Freitas, Nando, Blei, David M , and Jordan, Michael I Matching words and pictures Journal of Machine Learning Research, 3:1107{1135, March 2003 (cited on 283) Bengio, Yoshua, Louradour, Jer^ome, Collobert, Ronan, and Weston, Jason Cur- riculum learning InProceedings of ICML, pp 41{48, New York, NY, USA, 2009 ACM (cited on 287) Chen, Xinlei and Gupta, Abhinav Webly supervised learning of convolutional networks In Proceedings of ICCV, pp 1431{1439, 2015 (cited on 286) Chifu, Adrian-Gabriel and Ionescu, Radu Tudor Word sense disambiguation to improve precision for ambiguous queries Central European Journal of Com- puter Science, 2(4):398{411, 2012 (cited on 282) 289 REFERENCES Cozma, Madalina, Butnaru, Andrei, and Ionescu, Radu Tudor Automated essay scoring with string kernels and word embeddings In Proceedings of ACL, pp 503{509, 2018 (cited on 282) Dinu, Liviu P and Ionescu, Radu Tudor Clustering based on Median and Closest String via Rank Distance with Applications on DNA Neural Computing and Applications, 24(1):77{84, 2013 (cited on 282) Duygulu, P , Barnard, Kobus, Freitas, J F G de, and Forsyth, David A Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary InProceedings of ECCV, pp 97{112, London, UK, UK, 2002 Springer-Verlag (cited on 283) Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David Every picture tells a story: generating sentences from images In Proceedings of ECCV, pp 15{29, Berlin, Heidelberg, 2010 Springer-Verlag (cited on 283) Forsyth, David A and Ponce, Jean Computer Vision: A Modern Approach Prentice Hall Professional Technical Reference, 2002 (cited on 282) Han, Jiawei, Kamber, Micheline, and Pei, Jian Data Mining: Concepts and Techniques Morgan Kaufmann Publishers Inc , San Francisco, CA, USA, 3rd edition, 2011 (cited on 282) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian Deep Residual Learning for Image Recognition In Proceedings of CVPR, pp 770{778, June 2016 (cited on 285, 286, 287, 288) Inza, I~naki, Calvo, Borja, Arma~nanzas, Ruben, Bengoetxea, Endika, Larra~naga, Pedro, and Lozano, Jose A Machine learning: an indispensable tool in bioinfor- matics Methods in Molecular Biology (Clifton, N J ), 593:25{48, 2010 (cited on 282) Ionescu, Radu Tudor, Popescu, Andreea Lavinia, Popescu, Marius, and Popescu, Dan BiomassID: A Biomass Type Identiﬁcation System for Mobile Devices Computers and Electronics in Agriculture, 113:244{253, 2015a (cited on 282) 290 REFERENCES Ionescu, Radu Tudor, Alexe, Bogdan, Leordeanu, Marius, Popescu, Marius, Pa- padopoulos, Dim, and Ferrari, Vittorio How hard can it be? Estimating the diﬃculty of visual search in an image In Proceedings of CVPR, pp 2157{2166, June 2016 (cited on 287, 288) Ionescu, Radu Tudor, Smeureanu, Sorina, Alexe, Bogdan, and Popescu, Marius Unmasking the abnormal events in video In Proceedings of ICCV, pp 2895{ 2903, 2017 (cited on 286) Ionescu, RaduTudor, Chifu, Adrian-Gabriel, and Mothe, Josiane DeShaTo: De- scribing the Shape of Cumulative Topic Distributions to Rank Retrieval Sys- tems Without Relevance Judgments In Proceedings of SPIRE, volume 9309, pp 75{82 Springer LNCS, 2015b (cited on 282) Joachims, Thorsten Text Categorization with Suport Vector Machines: Learning with Many Relevant Features In Proceedings of ECML, pp 137{142, London, UK, UK, 1998 Springer-Verlag (cited on 285) Johnson, Rie and Zhang, Tong Eﬀective Use of Word Order for Text Catego- rization with Convolutional Neural Networks In Proceedings of NAACL, pp 103{112, 2015 (cited on 283) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoﬀrey E ImageNet Classiﬁca- tion with Deep Convolutional Neural Networks InProceedings of NIPS, pp 1106{1114, 2012 (cited on 282, 285, 286) Lee, Yong Jae and Grauman, Kristen Learning the easy things ﬁrst: Self-paced visual category discovery In Proceedings of CVPR, pp 1721{1728, 2011 (cited on 286) Leslie, Christina S , Eskin, Eleazar, and Noble, William Staﬀord The Spectrum Kernel: A String Kernel for SVM Protein Classiﬁcation InProceedings of Paciﬁc Symposium on Biocomputing, pp 566{575, 2002 (cited on 282) Leung, Thomas and Malik, Jitendra Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons International Jour- nal of Computer Vision, 43(1):29{44, June 2001 (cited on 283) 291 REFERENCES Lodhi, Huma, Saunders, Craig, Shawe-Taylor, John, Cristianini, Nello, and Watkins, Christopher J C H Text Classiﬁcation using String Kernels Journal of Machine Learning Research, 2:419{444, 2002 (cited on 282) Manning, Christopher D , Raghavan, Prabhakar, and Schutze, Hinrich Intro- duction to Information Retrieval Cambridge University Press, New York, NY, USA, 2008 (cited on 282) Pu, Wen, Liu, Ning, Yan, Shuicheng, Yan, Jun, Xie, Kunqing, and Chen, Zheng Local Word Bag Model for Text Categorization In Proceedings of ICDM, pp 625{630, Los Alamitos, CA, USA, 2007 IEEE Computer Society (cited on 283) Sadeghi, M A and Farhadi, A Recognition using visual phrases InProceed- ings of CVPR, pp 1745{1752, Washington, DC, USA, 2011 IEEE Computer Society (cited on 283) Sebastiani, Fabrizio Machine Learning in Automated Text Categorization ACM Computing Surveys, 34(1):1{47, March 2002 (cited on 282) Simonyan, K and Zisserman, A Very Deep Convolutional Networks for Large- Scale Image Recognition In Proceedings of ICLR, 2014 (cited on 285) Sivic, Josef, Russell, Bryan C , Efros, Alexei A , Zisserman, Andrew, and Free- man, William T Discovering Objects and their Localization in Images In Pro- ceedings of ICCV, pp 370{377 IEEE Computer Society, 2005 (cited on 283) Smeureanu, Sorina, Ionescu, Radu Tudor, Popescu, Marius, and Alexe, Bogdan Deep Appearance Features for Abnormal Behavior Detection in Video In Proceedings of ICIAP, volume 10485, pp 779{789, 2017 (cited on 286) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew Going Deeper With Convolutions In Proceedings of CVPR, pp 1{9, June 2015 (cited on 285, 287, 288) Szeliski, Richard Computer Vision: Algorithms and Applications Springer- Verlag New York, Inc , New York, NY, USA, 1st edition, 2010 (cited on 282) 292 REFERENCES Zhang, Jian, Marszalek, Marcin, Lazebnik, Svetlana, and Schmid, Cordelia Local Features and Kernels for Classiﬁcation of Texture and Object Categories: A Comprehensive Study International Journal of Computer Vision, 73(2):213{ 238, June 2007 (cited on 282) 293 List of Figures 1 1 An example in which the context helps to disambiguate an object (kitchen glove), which can easily be mistaken for something else if the rest of the image is not seen The image belongs to the Pascal VOC 2007 data set 5 1 2 An example of repetitive local image patterns that form the build- ing blocks of the bag-visual-words model 7 1 3 An object that can be described by multiple categories such as toy, bear, or both 8 2 1 A 3-NN model for handwritten digit recognition For visual in- terpretation, digits are represented in a two-dimensional feature space The ﬁgure shows 30 digits sampled from the popular MNIST data set When the new digit x needs to be recognized, the 3-NN model selects the nearest 3 neighbors and assigns label 4 based on a majority vote 35 2 2 A 1-NN model for handwritten digit recognition The ﬁgure shows 30 digits sampled from the popular MNIST data set The decision boundary of the 1-NN model generates a Voronoi partition of the digits 36 2 3 The function embeds the data into a feature space where the nonlinear relations now appear linear Machine learning methods can easily detect such linear relations 40 294 LIST OF FIGURES 2 4 A convolutional layer with two ﬁlters of size 5 5 3 that are applied on an input tensor of size 64 64 3, using a stride of 1 and a padding of 0 The resulting output tensor is of size 60602 59 2 5 A layer based on Rectiﬁed Linear Units (ReLU) The negative val- ues in the input tensor are replaced with 0, resulting in an output tensor of the same size as the input 60 2 6 A max-pooling layer with a ﬁlter support of 2 2 and a stride of 2 At every location, the ﬁlter keeps the maximum value The resulting output tensor is twice as small in height and width, but its depth is the same as the depth of the input tensor 61 3 1 The BOVW learning model for object class recognition The fea- ture vector consists of SIFT features computed on a regular grid across the image (dense SIFT) and vector quantized into visual words The frequency of each visual word is then recorded in a histogram The histograms enter the training stage Learning is done by a kernel method 78 3 2 The spatial similarity of two images computed with the SNAK framework First, the center of mass is computed according to the objectness map The average position and the standard deviation of the spatial distribution of each visual word are computed next The images are aligned according to their centers, and the SNAK kernel is computed by summing the distances between the average positions and the standard deviations of each visual word in the two images 83 3 3 A random sample of 12 images from the Pascal VOC data set Some of the images contain objects of more than one class For example, the image at the top left shows a dog sitting on a couch, and the image at the top right shows a person and a horse Dog, couch, person and horse are among the 20 classes of this data set 86 3 4 A random sample of 12 images from the Birds data set There are two images per class Images from the same class sit next to each other in this ﬁgure 87 295 LIST OF FIGURES 4 1 Matching hand trajectories using Dynamic Time Warping A cost matrix is computed while aligning the frames Dynamic program- ming is employed to ﬁnd the optimal alignment 105 4 2 Matching hand trajectories using Local Frame Match Distance Each hand location in the trajectory M is independently matched to the nearest hand location in the trajectory Q (there is no global alignment) 106 5 1 Our supervised anomaly detection framework based on Narrowed Motion Clusters In the training phase, we apply a two-stage out- lier detection algorithm based on k-means and one-class SVM In the testing phase, we label a test sample as abnormal if its max- imum normality score among the scores provided by the trained one-class SVM models is negative 117 5 2 Our unsupervised anomaly detection framework based on unmask- ing [Koppel et al , 2007] The steps are processed in sequential order from (A) to (H) 119 5 3 A set of 400 data points sampled from two normal distributions of diﬀerent means The points are clustered into 30 clusters using k-means The centroids of clusters with less than 10 samples are represented with a large blue square 131 5 4 A histogram representing the number of data points in each cluster The histogram corresponds to the k-means clustering applied over the 400 data points illustrated in Figure 9 2 A threshold of 10 is used to detect clusters of outliers 132 5 5 Frame-level anomaly detection scores (between 0 and 1) provided by our unmasking framework based on the late fusion strategy, for test video 4 in the Avenue data set The video has 947 frames Ground-truth abnormal events are represented in cyan, and our scores are illustrated in red 139 5 6 True positive (top row) versus false positive (bottom row) detec- tions of our unmasking framework based on the late fusion strategy Examples are selected from the Avenue data set 139 296 LIST OF FIGURES 5 7 Frame-level anomaly detection scores (between 0 and 1) provided by our approach based on combining NMC and CNN, for test video 4 in the Avenue data set The video has 947 frames Ground- truth abnormal events are represented in pink, and our scores are illustrated in blue Best viewed in color 141 5 8 True positive (top row) versus false positive (bottom row) detec- tions of our supervised framework based on NMC and CNN Ex- amples are selected from the Avenue data set Best viewed in color 141 5 9 True positive (top row) versus false positive (bottom row) detec- tions of our unmasking framework based on the late fusion strategy Examples are selected from the Subway Entrance gate 143 5 10 True positive (top row) versus false positive (bottom row) detec- tions of our supervised framework based on NMC and CNN Ex- amples are selected from the Subway Entrance gate 143 5 11 True positive (top row) versus false positive (bottom row) detec- tions of our unmasking framework based on the late fusion strategy Examples are selected from the UCSD data set 146 5 12 True positive (top row) versus false positive (bottom row) detec- tions of our supervised framework based on NMC and CNN Ex- amples are selected from the UCSD Ped1 data set 146 5 13 Frame-level anomaly detection scores (between 0 and 1) provided by our unmasking framework based on the late fusion strategy, for the ﬁrst scene in the UMN data set The video has 1453 frames Ground-truth abnormal events are represented in cyan, and our scores are illustrated in red 147 5 14 True positive (top row) versus false positive (bottom row) detec- tions of our unmasking framework based on the late fusion strategy Examples are selected from the UMN data set 149 5 15 Frame-level anomaly detection scores (between 0 and 1) provided by our framework based on the late fusion strategy, for the third scene in the UMN data set The video has 1744 test frames Ground-truth abnormal events are represented in pink, and our scores are illustrated in blue 149 297 LIST OF FIGURES 5 16 True positive (top row) versus false positive (bottom row) detec- tions of our framework based on NMC and CNN Examples are selected from the UMN data set 150 6 1 The precision-recall curves of the state-of-the-art aligners versus the precision-recall curves of the two LRD aligners, when 10; 000 contaminated reads of length 100 from the orangutan are included The two variants of the BOWTIE aligner are based on local and global alignment, respectively The LRD aligner based on hash tables is a fast approximate version of the original LRD aligner 178 6 2 The precision-recall curves of the state-of-the-art aligners versus the precision-recall curves of the two LRD aligners, when 50; 000 contaminated reads of length 100 from 5 mammals are included The two variants of the BOWTIE aligner are based on local and global alignment, respectively The LRD aligner based on hash tables is a fast approximate version of the original LRD aligner 181 6 3 Local Rank Distance computed in the presence of diﬀerent types of DNA changes such as point mutations, indels and inversions In the ﬁrst three cases (a), (b) and (c), a single type of DNA polymorphism is included in the second (bottom) string The last case (d) shows how LRD measures the diﬀerences between the two DNA strings when all the types of DNA changes occur in the second string The nucleotides aﬀected by changes are marked with bold To compare the results for the diﬀerent types of DNA changes, the ﬁrst string is always the same in all the four cases Note that in all the four examples, LRD is based on 1-mers In each case, LR= l+ r 195 Def tight 8 1 The BOSWE model for text classiﬁcation Words are embedded into a vector space and quantized into super word vectors The fre- quency of each super word vector is then recorded in a histogram The histograms enter the training stage Learning is done by a kernel method 228 298 LIST OF FIGURES 9 1 An example of building a global sense conﬁguration with Shotgun- WSD for a document of 7 words The algorithm is based on three main phases: building local sense conﬁgurations using a brute-force approach, assembling shorter conﬁgurations into longer conﬁgura- tions by preﬁx-suﬃx matching and majority voting 259 9 2 A set of 400 data points sampled from two normal distributions of diﬀerent means The points are clustered into 30 clusters using k-means The centroids of clusters with less than 10 samples are represented with a large blue square 265 9 3 A histogram representing the number of data points in each cluster The histogram corresponds to the k-means clustering applied over the 400 data points illustrated in Figure 9 2 A threshold of 10 is used to detect clusters of outliers 266 9 4 The F1scores of ShotgunWSD 2 0 on the ﬁrst document of Se- mEval 2007, using diﬀerent numbers of clusters for k-means 269 9 5 The F1scores of ShotgunWSD 2 0 on the ﬁrst document of Se- mEval 2007, using diﬀerent thresholds for eliminating the smaller k-means clusters 270 10 1 Images with diﬃculty scores predicted by our system in increasing order of their diﬃculty 286 299 List of Tables 3 1 Mean AP on Pascal VOC 2007 data set for diﬀerent representa- tions that encode spatial information into the BOVW model For each representation, results are reported using several kernels and vocabulary dimensions The best AP for each vocabulary dimen- sion and each kernel is highlighted in bold 90 3 2 Classiﬁcation accuracy on the Birds data set for diﬀerent repre- sentations that encode spatial information into the BOVW model For each representation, results are reported using several kernels and vocabulary dimensions The best accuracy for each vocabulary dimension and each kernel is highlighted in bold 92 4 1 The accuracy rates of DTW and 1-NN versus two classiﬁers (1-NN and KDA) based on LFMD The results of LFMD are obtained using p-frames of lengths 1, 2 and 3 The methods are compared using a 3-fold cross-validation procedure The best result for each n is highlighted in bold 109 5 1 Abnormal event detection results (in %) in terms of frame-level and pixel-level AUC on the Avenue data set Our unsupervied frame- work based on unmasking and our supervised framework based on Narrowed Motion Clusters (as well as preliminary versions of the supervised framework) are compared with several state-of-the-art approaches [Del Giorno et al , 2016; Hasan et al , 2016; Lu et al , 2013; Smeureanu et al , 2017], which are listed in temporal order 138 300 LIST OF TABLES 5 2 Abnormal event detection results (in %) in terms of frame-level and pixel-level AUC on the Avenue17 data set Our supervised framework is compared with [Hinami et al , 2017] 140 5 3 Abnormal event detection results (in %) in terms of frame-level AUC on the Subway data set Our unsupervied framework based on unmasking and our supervised framework based on Narrowed Motion Clusters are compared with several state-of-the-art ap- proaches [Cheng et al , 2015; Cong et al , 2011; Del Giorno et al , 2016; Hasan et al , 2016; Saligrama & Chen, 2012], which are listed in temporal order 142 5 4 Abnormal event detection results (in %) in terms of frame-level and pixel-level AUC on the UCSD data set Our unsupervied frame- work based on unmasking and our supervised framework based on Narrowed Motion Clusters are compared with several state-of- the-art supervised methods [Cheng et al , 2015; Cong et al , 2011; Hasan et al , 2016; Hinami et al , 2017; Ionescu et al , 2017; Kim & Grauman, 2009; Lu et al , 2013; Mahadevan et al , 2010; Mehran et al , 2009; Ravanbakhsh et al , 2017; Ren et al , 2015; Saligrama & Chen, 2012; Sun et al , 2017; Xu et al , 2015; Zhang et al , 2016], which are listed in temporal order 145 5 5 Abnormal event detection results (in %) in terms of frame-level AUC on the UMN data set Our unsupervied framework based on unmasking and our supervised framework based on Narrowed Motion Clusters are compared with several state-of-the-art meth- ods [Cong et al , 2011; Del Giorno et al , 2016; Mehran et al , 2009; Ravanbakhsh et al , 2017; Saligrama & Chen, 2012; Smeure- anu et al , 2017; Sun et al , 2017; Zhang et al , 2016], which are listed in temporal order 148 6 1 The 20 mammals from the EMBL database used in the sequence alignment experiments The accession number is given on the last column 174 301 LIST OF TABLES 6 2 The genomic sequence information of three vibrio pathogens con- sisting of two circular chromosomes 175 6 3 Several statistics of the state-of-the-art aligners versus the LRD aligner, when 10; 000 contaminated reads of length 100 sampled from the orangutan genome are included The AUC is computed from the ROC curve, while the best F1and F2measures where computed using diﬀerent points on the precision-recall curve The F2measure puts a higher weight on recall 177 6 4 Metrics of the human reads mapped to the human mitochondrial genome (true positives) by the hash LRD aligner versus the human reads that are not mapped to the genome (false negatives) The average edit distance is reported for true positive (TP) and false negative (FN) reads, respectively The average edit distance is given for several points on the precision-recall curve of the hash LRD aligner, going from 100% precision to 100% recall The points are obtained by varying the LRD threshold from 51 to 539 179 6 5 Several statistics of the state-of-the-art aligners versus the LRD aligner, when 50; 000 contaminated reads of length 100 sampled from the genomes of 5 mammals are included The AUC is com- puted from the ROC curve, while the best F1and F2measures where computed using diﬀerent points on the precision-recall curve The F2measure puts a higher weight on recall 182 6 6 The recall at best precision of the state-of-the-art aligners versus the LRD aligner, when 10; 000 contaminated reads of length 100 sampled from the orangutan genome are included 183 6 7 The recall at best precision of the state-of-the-art aligners versus the LRD aligner, when 40; 000 contaminated reads of length 100 sampled from the blue whale, the harbor seal, the donkey, and the house mouse genomes are included, respectively 183 302 LIST OF TABLES 6 8 The results for the real-word setting experiment on mammals The results of clustering unknown organisms using the BWA aligner, the BLAST aligner, the BOWTIE aligner and the LRD aligner are presented on columns, respectively Mammals are labeled with numbers from 1 to 20, given on the second column The label of the closest species obtained by each aligner is reported for each mammal Incorrectly clustered mammals are marked in bold and with an asterisk Classes are actually 3-letter preﬁxes of order names Unknown organisms are represented by 20; 000 reads of length 100 simulated from the original genomes Half of the reads are reverse complements 186 6 9 The results for the hard setting experiment on mammals The results of clustering unknown organisms using the BWA aligner, the BLAST aligner, the BOWTIE aligner and the LRD aligner are presented on columns, respectively Mammals are labeled with numbers from 1 to 20, given on the second column The label of the closest species obtained by each aligner is reported for each mammal Incorrectly clustered mammals are marked in bold and with an asterisk Classes are actually 3-letter preﬁxes of order names Unknown organisms are represented by 200 reads of length 100 (half of them being reverse complements) simulated from the original genomes, using an error rate of 0:08 and a mutation rate of 0:008 188 6 10 The running times of the BWA aligner, the BLAST aligner, the BOWTIE aligner and the LRD aligner The aligners are com- pared on the task of aligning 7; 676; 000 short DNA reads of 100 bases long on a reference mtDNA genome of roughly 15; 000-17; 000 bases The aligners were evaluated on a computer with Intel Core i7 2:3 GHz processor and 8 GB of RAM memory using a single Core 190 303 LIST OF TABLES 6 11 The results of the rank-based aligner on vibrio species The LRD aligner is based 3-mers, a maximal oﬀset of 36, and a Local Rank Distance threshold of 1000 The scores obtained by the LRD aligner for simulated reads ofV vulniﬁcuschromosomes I and II aligned into V parahaemolyticus and V cholerae are presented in this table The ﬁrst column indicates the source chromosome of the simulated reads The second column indicates the reference chromosome The third and fourth columns show the scores of the two aligners computed with the evaluation tool provided in the software package 192 7 1 Confusion matrix (also known as contingency table) of a binary classiﬁer with labels +1 or1 213 7 2 Empirical results on the Reuters-21578 corpus obtained by the standard bag-of-words versus two methods that encode spatial in- formation, namely spatial pyramids and SNAK The macro-averaged and the micro-averaged F1measures are reported for two evalua- tion modes, one that includes unlabeled documents and one that excludes unlabeled documents The learning is always done by KRR The best scores are highlighted in bold The marker * indi- cates that the performance is signiﬁcantly better than the baseline according to a Student's t-test performed at a signiﬁcance level of 0:01 214 7 3 Empirical results on the 20 Newsgroups corpus obtained by the standard bag-of-words versus two methods that encode spatial in- formation, namely spatial pyramids and SNAK The macro-averaged and the micro-averaged F1measures are reported for the 4-fold cross validation procedure The learning is always done by KRR The best scores are highlighted in bold The marker * indicates that the performance is signiﬁcantly better than the baseline ac- cording to a Student's t-test performed at a signiﬁcance level of 0:01 216 304 LIST OF TABLES 8 1 Accuracy rates using 10-fold cross-validation on the Movie Review data set with diﬀerent kernels and vocabulary dimensions The best accuracy rate for each vocabulary dimension is highlighted in bold 232 8 2 Accuracy rates using 10-fold cross-validation on the Movie Review data set with various BOSWE conﬁgurations versus two baseline approaches The best accuracy rate is highlighted in bold 233 8 3 Confusion matrix of a binary classiﬁer with labels +1 or1 There are four distinct groups of samples illustrated here: true positive (T P ), false positive (F P ), false negative (F N), and true negative (T N) 235 8 4 Results on the Reuters-21578 test set with diﬀerent kernels and vocabulary dimensions The bestmicroF1andmacroF1scores for each vocabulary dimension are highlighted in bold 236 8 5 Results on the Reuters-21578 test set with various BOSWE conﬁg- urations versus a baseline bag-of-words model The best microF1 and macroF1scores are highlighted in bold 237 8 6 The number of essays and the score ranges for the 8 diﬀerent prompts in the Automated Student Assessment Prize (ASAP) data set 238 8 7 In-domain automatic essay scoring results of our approach versus several state-of-the-art methods [Dong & Zhang, 2016; Dong et al , 2017; Phandi et al , 2015; Tay et al , 2018] Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation The best QWK score (among the machine learning systems) for each prompt is highlighted in bold 240 8 8 Corss-domain automatic essay scoring results of our approach ver- sus two state-of-the-art methods [Dong & Zhang, 2016; Phandi et al , 2015] Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as [Dong & Zhang, 2016; Phandi et al , 2015] The best QWK scores for each source!target domain pair are highlighted in bold 241 305 LIST OF TABLES 9 1 A summary of the number of ambiguous words along with the distribution of ambiguous words per part-of-speech in the four data sets considered in our evaluation 267 9 2 The F1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of various unsupervised state-of-the-art WSD ap- proaches, on the SemEval 2007 coarse-grained English all-words task The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for ShotgunWSD 2 0 are ob- tained for windows of n = 6 words, a majority vote on the top k = 15 conﬁgurations and k-means clustering with 250 clusters 270 9 3 TheF1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of an unsupervised WSD approach and the extended Lesk measure, on the Senseval-2 English all-words data set The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows of n = 6 words, a majority vote on the topk = 15 conﬁgurations and k-means clustering with 250 clusters 272 9 4 TheF1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of an unsupervised WSD approach and the extended Lesk measure, on the Senseval-3 English all-words data set The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for ShotgunWSD 2 0 are obtained for windows of n = 6 words, a majority vote on the topk = 15 conﬁgurations and k-means clustering with 250 clusters 273 306 LIST OF TABLES 9 5 TheF1scores of ShotgunWSD 1 0 and ShotgunWSD 2 0 versus the F1scores of two state-of-the-art WSD approaches, on the SemEval 2015 English all-words task The results reported for ShotgunWSD 1 0 are obtained for windows of n = 8 words and a majority vote on the top k = 15 conﬁgurations The results reported for Shot- gunWSD 2 0 are obtained for windows ofn = 6 words, a majority vote on the topk = 15 conﬁgurations and k-means clustering with 250 clusters 274 307